|Home | About | Journals | Submit | Contact Us | Français|
Cancer genomes frequently undergo genomic instability resulting in accumulation of chromosomal rearrangement. To date, one of the main challenges has been to confidently and accurately identify these rearrangements using short-read massively parallel sequencing. We were able to improve cancer rearrangement detection by combining two distinct massively parallel sequencing strategies: fosmid-sized (36 Kilobases on average) and standard 5 Kilobase mate pair libraries. We applied this strategy to map rearrangements in two breast cancer cell lines, MCF7 and HCC1954. We detect and validate a total of 91 somatic rearrangements in MCF7 and 25 in HCC1954, including genomic alterations corresponding to previously reported transcript aberrations in these two cell lines. Each of the genomes contains two types of breakpoints – clustered and dispersed. In both cell lines, the dispersed breakpoints show enrichment for low copy repeats, while the clustered breakpoints associate with high-copy number amplifications. Comparing the two genomes, we observe highly similar structural mutational spectra affecting different sets of genes, pointing to similar histories of genomic instability against the background of very different gene network perturbations.
End sequence profiling of clonal libraries have been used extensively to discover structural variation in both normal and cancer genomes [1–5]. Recently, the adoption of massively parallel sequencing has supplemented structural variation detection by identifying rearrangements at fine-scale resolution for both normal and cancer genomes [6–14]. These massively parallel sequencing studies have significantly added to the catalog of genomic rearrangements, but the limited insert sizes between paired ends have provided less power than larger insert clones to map across duplications and repeat-rich regions in the genome, thereby missing a large fraction of variation [2, 15]. Massively parallel mate pair sequencing is also hindered by high false positive rearrangement discovery rates, requiring additional breakpoint validation. Commonly employed techniques for breakpoint validation include: optical mapping based on restriction enzyme maps or incorporated fluorochrome-labelled nucleotide [2, 16], hybridization of fluorescent probes that span rearrangements [5, 17], and polymerase chain reaction amplification across aberrant fusions followed by Sanger-based sequencing of the breakpoint amplicons [1, 7, 11, 12, 14]. Although these validation techniques offer proof of genomic rearrangement, none are currently amenable to high-throughput workflows.
We supplement the limited insert size of standard massively parallel mate pair sequencing by incorporating fosmid-sized insert libraries, thereby providing additional validation of detected rearrangements. Our fosmid-sized mate pair libraries, called fosmid diTags, leverage the affordable costs and high-throughput capacity of massively parallel sequencing while providing clone-sized inserts able to span long repetitive sequence elements. Fosmid diTags are well-suited for rearrangement detection either in stand-alone or complementary fashion with other mate pair libraries; fosmid diTags are also advantageous for de novo genome assembly where larger insert size facilitates greater continuity . Fosmid diTags are an extension of paired end tag methods [14, 19–21], where short paired tags from the ends of DNA fragments are enzymatically extracted and covalently linked as ditag constructs for high-throughput sequencing, see Supplemental Figures 1 and 2 for fosmid diTag workflow details.
Paired end sequencing methods exploit the fact that structural abnormalities consist of two chromosomal segments that are in a relative position, orientation, or at a relative distance that is not consistent with the reference genome assembly. Construction of paired end sequencing libraries that adequately cover the genome of interest allows for comprehensive identification of structural abnormalities.
1.55 million MCF7 (ATCC HTB-22) and 1.50 million HCC1954 (ATCC CRL-2338) fosmids were cloned using the novel pFosDT1.2 vector (derived from the Epicentre pCC1FOS™ plasmid). The pFosDT1.2 vector contains two EcoP15I restriction sites that flank the site of insertion. EcoP15I, a type III restriction endonuclease, cuts 25/27 bp downstream of its recognition site and requires two separated and inversely oriented recognition sites in supercoiled DNA. Addition of sinefungin in the EcoP15I digest reaction facilitates cleavage at all recognition sites independent of DNA topology . Starting with 10 μg of purified pooled fosmid DNA from each breast cancer cell line, two independent long-range clonal insert, fosmid diTag massively parallel sequencing libraries were produced. For each fosmid library, 26 bp end-tags from the insert termini are isolated and concatenated as illustrated in Supplemental Figure 2.
Illumina mate pair whole-genome shotgun libraries, of insert sizes ranging from 4 to 6 Kb, were additionally constructed using 10 μg of genomic DNA from each of the MCF7 (ATCC HTB-22) and HCC1954 (ATCC CRL-2338) cell lines. Mate pair libraries were prepared according to the manufacturer’s instructions (Illumina PE-112-1002). Two separate MCF7 mate pair libraries with 4 Kb and 6 Kb inserts were constructed, and a single HCC1954 mate pair library with 5 Kb inserts was constructed.
The fosmid diTag and Illumina mate pair libraries were sequenced on an Illumina Genome Analyzer II massively parallel sequencing system following the manufacturer’s instructions. Raw sequence data for the fosmid diTag and standard Illumina mate pair libraries is available online at www.genboree.org/breastCellLineReads/.
Novocraft-V2.05.02 was used to align quality-filtered paired end reads to the reference human genome (March 2006 assembly, NCBI Build 36.1, UCSC Build HG18). Novoalign parameters used for mapping are described in Supplementary Materials and mapping coordinates are available for viewing and download through the Genboree open-hosting genome browser at www.genboree.org.
Fosmid diTags and Illumina mate pair sequences that align discordantly were used to call putative structural rearrangements. The combined fosmid diTag and standard Illumina mate pair structural rearrangements that validated as cancer-specific somatic mutations are available in the Supplemental Table.
Determining structural variants from Illumina mate pair and fosmid diTag sequences is complicated by two factors: the contamination of inward facing reads and the formation of chimeric clones, respectively. Inward facing reads are paired end sequences from a contiguous piece of DNA sized equal to the final sequencing library length, approximately 400 bp. Formation of chimeric clones during the fosmid diTag procedure introduce false information about the distance and orientation between two reads, complicating structural variant calling.
False positive breakpoints called by inward facing reads were removed prior to reporting structural variants by filtering the discordant read clusters. Inward facing read clusters were filtered based on size and their inward facing read orientation. Because inward facing read clusters are limited by the final sequencing library length, they are easily identified and maybe removed from further analysis, see Supplementary Materials for filter parameters. However, inward facing read clusters that span true positive rearrangements are unable to be filtered, thereby introducing persistent confounding contamination. To overcome fosmid diTag chimera noise, discordant reads supporting the same structural variant were clustered. Clusters are formed if there are at least two uniquely mapping paired end signatures with corroborating genomic positions, sizes and read orientations, such a strategy is called standard clustering and is commonly employed [6, 8, 10, 23–25].
PCR primers were designed for amplification across aberrant fusions using the human reference genome (March 2006 assembly, NCBI Build 36.1, UCSC Build HG18). The Primer3 primer design algorithm was employed to obtain a set of nested primers using two categories or parameters, stringent and relaxed. Primer pairs in each category are scored and the highest-scoring primer pair is selected for PCR assay validation. Priority is given to the stringent category using the repeat-masked human reference genome. In cases of PCR amplification failure, additional lower-scoring primer pairs were utilized. More details, including Primer3 parameters, can be found in the Supplementary Materials; the automated primer design pipeline code is available for download at http://github.com/oliverhampton/Breakpoint-Primer-Design.
Uniquely mapping reads were used as input for the readDepth R package  which calls copy number alterations by evaluating depth of sequence coverage. The package’s default parameters were used including an overdispersion value of 3 and a false-discovery rate of 0.01. The readDepth package also provided a breakpoint refinement tool that allowed us to adjusted copy number segment ends to matched breakpoint positions.
Combined fosmid-sized and 5 Kb breakpoints that map within 2 Mb in MCF7 and 5 Mb in HCC1954 were clustered. Chromosome segment annotations were retained if five or more breakpoints in MCF7 or two or more breakpoints in HCC1954 were contained within the cluster. For each set of MCF7 and HCC1954 breakpoint clusters, the cluster containing the highest number of breakpoints served as a seed for a connected graph (or clique) where the chromosome segments are nodes and spanning breakpoints are edges. In this manner, cliques of four breakpoint clusters in MCF7 on chromosomes 1, 3, 17, and 20, and five breakpoint clusters in HCC1954 on chromosomes 5, 8, and 11 were constructed as shown in Figure 2.
Each of the fosmid diTag and Illumina mate pair breakpoints from MCF7 and HCC1954 breast cancer cell lines were examined for the presence of low copy repeats (LCRs). Intra- and inter-chromosomal homologous LCRs were detected by applying a novel algorithmic method to the human reference genome sequence (March 2006 assembly, NCBI Build 36.1, UCSC Build HG18). The method achieved higher sensitivity than previously applied methods  by using k-mer frequency sequence information to detect, parse and cluster LCRs, without removing high copy number repetitive elements (repeat-masking). The LCRs detected by this method covered 6% of the whole genome in length, of which 19% were gene-containing regions. A detailed description of the algorithm is available in the Supplementary Materials.
Breast cancer cell line breakpoint confirmation employed PCR amplification of MCF7 (ATCC HTB-22) and a pool of negative control genomic DNA isolated from human female (Novagen 70605-3) and from two different cell lines, MCF10A (ATCC CRL-10317) and HCC1599-BL (ATCC CRL-2332); and PCR amplification of genomic DNA from HCC1954 (ATCC CRL-2338) and negative control HCC1954-BL (ATCC CRL-2339) cell lines. Genomic cell line DNA was isolated with the DNAeasy kit (Qiagen). PCR bands were visualized on a 2% agarose gel.
The Illumina standard mate pair libraries, with an average 5 Kb insert size, generated 2.9 and 1.9 Gigabases of sequence data for MCF7 and HCC1954, respectively. Upon mapping to the reference genome, the relatively short distance between the paired ends is compatible for PCR primer design across aberrant fusions, and the density of mapped reads allow measurement of segment copy number. The fosmid diTag libraries generated 93.3 and 56.9 Megabases of sequence data for MCF7 and HCC1954, respectively. Because of the larger insert size, mapping of fosmid diTags provided near identical percent-coverage of the reference genome (81% ± 7%) as observed from the Illumina mate pair libraries.
Rearrangements are reported where at least two independent pairs of ends show discrepancy by their predicted size and/or orientation. Discordant mate pairs and diTags are reported when the distance between the mapped ends is in excess of two standard deviations from the insert mean. Discordant ends are clustered based on mapping position and orientation discrepancies, thereby refining the position of the detected breakpoint. From the Illumina 5 Kb mate pair libraries, we identify 23,555 putative rearrangements in MCF7 and 3,824 in HCC1954. Breakpoint spanning PCR primers designs were able to be created for 23% of these rearrangements in MCF7 and 61% in HCC1954. From the fosmid diTag libraries, we identify 713 putative rearrangements in MCF7 and 345 in HCC1954 – because of the much longer fosmid diTag insert size and relatively low fold-coverage, standard and long-range PCR primer design was incompatible.
The high percentage of failed PCR primer designs from the MCF7 mate pair data is due to increased prevalence of repetitive sequence elements surrounding aberrant fusions. Closer inspection of the PCR primer design failure sites reveals overlap with repeat-masked sections of the human genome and disproportionate calling of small indels (2–4 Kb) at a rate ten times more than expected. We speculate that MCF7 has unique defects in its DNA repair pathways, which explains the imbalance of mutations between the two cell lines. We have previously shown RAD51C to be mutated in MCF7 ; such a mutation could affect the Holliday junction (HJ)  resolution machinery causing misrecognition of HJs, cruciforms, and other homology-driven secondary structures leading to double-strand breaks and accumulation of such indels .
Breakpoint spanning primers from the Illumina mate pairs were applied to their respective breast cancer cell line genomes and normal controls. In most cases, the PCR assay failed to produce an amplification product, indicating a high rate of false positive rearrangement detection. In the cases where a breakpoint amplicon was produced, the majority identified normal structural polymorphisms – only a small percentage identified breast cancer-specific somatic mutation, see Figure 1B Venn diagrams. Interestingly, combining fosmid diTag and Illumina mate pair data, and selecting rearrangements detected by both methods show 3-fold enrichment for cancer-specific somatic mutation and 2-fold reduction in false positives when compared to the Illumina mate pair libraries alone. Combining fosmid-sized and 5 Kb mate pairs provides cross validation to rearrangement detection; moreover, the incorporation of longer fosmid-sized inserts increases specificity to detect breast cancer-specific somatic mutation and decreases the reporting of false positives when compared to the shorter 5 Kb inserts alone. Combining fosmid diTag and 5 Kb mate pair libraries, we identify 309 chromosomal rearrangements in MCF7 and 72 in HCC1954, and design breakpoint spanning PCR primers for approximately 90% of them, see Figure 1A for the positions of the combined rearrangements. While it is desirable to increase the specificity of chromosomal rearrangement detection, it must be noted that a corresponding loss of sensitivity is associated with this improvement.
Chimeric gene transcripts have been previous identified in MCF7 [31, 32] and HCC1954  using transcript mapping. Transcript mapping is analogous to targeted paired end sequencing; however, instead of investigating aberrant genomic fusions, chimeric mRNA transcripts are queried. Transcript mapping delivers a gene-centric view of rearrangements that encompass post-transcriptional modifications, but can’t detect genomic rearrangements outside of gene coding regions. We therefore sought to comprehensively identify rearrangement events at the genomic DNA level that may have caused chimeric or truncated mRNA transcripts.
In MCF7, we identified ten out of nineteen and nine out of thirty genomic rearrangements that are correlated with corresponding chimeric mRNA transcripts reported by Maher, CA et. al. [9, 31] and Inaki, K et. al. , respectively. These genomic lesions involve oncogenes (TMEM49), tumor suppressors (SULF2, PTPRG), constituents of DNA double-strand break repair (RAD51C.2, BRIP1), and other genes related to cell cycle, growth, and survival (RPS6KB1, ELOVL7, ABCA5), see Table I for functional details.
In HCC1954, we identified three out of seven genomic rearrangements resulting in chimeric or truncated gene transcripts reported by Zhao et. al. . These three gene truncations (EIF3E, NSD1, PVT1) are implicated in differing aspects of breast and ovarian cancers, and acute myeloid leukemia pathophysiologies, see Table I for functional details. In addition, we discovered a novel genomic rearrangement of UIMC1 (RAP80), a DNA double-stranded break repair accessory protein and suspected tumor suppressor, resulting in the loss of its last 5 exons (exons 11–15), which would eliminate its DNA recognition and binding abilities. The fused DNA (8q24.21) downstream of the UIMC1 breakpoint does not contain any exons or introns, and it remains unclear whether the truncated mRNA would be stable as there is no transcription stop site or polyA tail.
The luminal-type MCF7 and ERBB2-overexpressing HCC1954 breast cancer cell lines are both highly amplified and display complex structural mutability phenotypes; exhibiting distinct profiles of genome structural rearrangement and copy number variation. We integrated read density and breakpoint information from mapped fosmid-sized and 5 Kb mate pair libraries to accurately identify copy number variation (CNV) using the readDepth R package , see Figures 1A and and22 for visualized copy number counts.
For comparison, we obtained data for both breast cancer cell lines run on Affymetrix 100K SNP chips segmented with the GLAD algorithm . Even with approximately 1-fold sequence coverage, our results provide higher resolution than the Affymetrix arrays and allow for CNV calls to be made in many regions where no array probes exist. A look at gross features shows good concordance between the two platforms, including detection of previously described high-level amplifications on MCF7 ; and on HCC1954 , see highlighted regions in Figure 2B. Notably, our sequence-based approach provides higher dynamic range and reveals multiple regions in both cell lines that have been copied 50 to 100 times. Due to saturation effects and lower resolution, these regions are called with far lower copy number on the Affymetrix arrays. There are a small number of aberrations, including regions on chromosomes 2 and 9 in the MCF7 genome, which we believe to be biological differences between different passages and/or sublines of MCF7. Most other discordant events are likely attributable to increased coverage, resolution, and dynamic range from the sequence-based assays.
In MCF7, a 20 Kb segment on cytoband 20q13.31 shows the highest level of amplification with a copy number count of 70. This region encompasses the BMP7 gene, a member of the transforming growth factor-beta superfamily, and corresponds to the fusion of the BMP7 promoter upstream of ZNF217 oncogene, which is overexpressed in breast cancer . ZNF217 can attenuate apoptotic signals resulting from telomere dysfunction and may promote neoplastic transformation during later stages of malignancy .
In HCC1954, a 51 Kb segment on cytoband 17q12 showed the highest level of amplification with a copy number count of 117. This region encompasses HER2/neu (also known as ERBB-2) which is known to be overexpressed in this cell line. HER2 overexpression in breast cancer is associated with an aggressive tumor phenotype, increased disease recurrence, and overall worse prognosis. HER2 overexpression serves not only as a prognostic marker, but also as a drug target for the monoclonal antibody trastuzumab. Also of interest is a 59 Kb segment on cytoband 11q13.2 showing 9-fold amplification and encompassing the gene CCND1. The CCND1 gene, a key cell-cycle regulator, is often overexpressed in breast cancer patients, and correlates with shorter relapse-free survival times .
In the MCF7 and HCC1954 breast cancer cell lines, we identified rearrangements in genes that code for members of protein complexes involved in DNA double-stranded break repair (DSBR), raising the possibility that distinct defects in DSBR genes may have contributed to different patterns of genomic instability. For example, in MCF7 we identified the gene-gene fusion of RAD51C exons 1–7 to the neuronal-specific gene ATXN7 exons 6–13 resulting in an expressed chimeric transcript. RAD51C is a paralog of RAD51 a gene central to DNA DSBR. RAD51C is an essential component of a complex reported to be involved in resolving Holliday junctions (HJs)  formed during DSBR  and, as such, is integral to the maintenance of genomic stability. The translocation we have identified eliminates the domain of RAD51C that binds other family members such as RAD51D and Xrcc3 , possibly disrupting formation of the complex responsible for resolving HJs.
Also in MCF7, we identified truncation of the BRIP1 gene, BRCA1-interacting protein-1. BRIP1 was originally identified as a helicase-like protein that interacts directly with BRCA1 and contributes to its DNA repair function. BRIP1 binds to the BCRT repeat in BRCA1. The C-terminus of BRIP1 is critical for its interaction with BRCA1, and a truncation mutant has been shown to block DSBR [40–42]. Clinically, germline truncation mutations of BRIP1 have been identified in familial breast cancer without mutations of BRCA1/2, and BRIP1 truncations confer a two-fold increased risk of developing breast cancer. We identified a translocation that results in the loss of the last three exons (exons 18–20), however the fused DNA (3p14) downstream of BRIP1 does not contain any exons or introns. The truncation at exon 17 of BRIP1 would eliminate the C-terminal third of BRIP1 and eliminate binding to BRCA1. However, it is unclear at present whether the truncated mRNA would be stable as there is no transcription stop site or polyA tail.
In HCC1954 we discovered a novel gene truncation of UIMC1 (also referred to as the BRCA1-A complex subunit RAP80). RAP80 has been extensively studied because of its roles in localizing BRCA1 to DNA double strand break sites, regulating BRCA1-dependent DNA damage checkpoint function, and as a potential tumor suppressor [43, 44]. Whereas many RAP80 missense SNP mutations have been identified in non-BRCA1/2 multi-ethnic breast cancer cases [45, 46], no truncating mutation of the RAP80 gene in breast cancer has been previously published. Interestingly, Dr. Xiaochun Yu has identified a truncating SNP mutation on RAP80 cDNA (G1107A) in the ovarian adenocarcinoma cell line TOV21G that results in a premature stop codon at Trp369. This truncation product disrupts the RAP80 interaction with BRCA1 and fails to localize to nuclear foci following DNA damage . The UIMC1 truncation we identified cleaves the native transcript after exon 10 and results in loss of the C-terminus exons 11–15, similarly eliminating DNA recognition and binding capability.
While our fosmid diTags and Illumina mate pairs do not detect the previously published HCC1954 gene truncation of MRE11A , we do confirm the existence of the t(4;11)(q32;q21) genomic lesion involving MRE11A in another study using 2 Kb Life Technologies SOLiD mate pairs (unpublished). MRE11A is involved in homologous recombination, telomere length maintenance and DNA DSBR; and this truncation eliminates its DNA binding domain in the HCC1954 breast cancer cell line.
As evidenced from Figure 1A, the breakpoints in MCF7 and HCC1954 are not evenly distributed across the genome. A number of clusters of closely spaced breakpoints are evident. To formally delineate clustered breakpoints from the remainder, breakpoints within 2 Mb in MCF7 and 5 Mb in HCC1954 were clustered. In each cell line, the cluster containing the highest number of breakpoints was selected to seed a connected graph where chromosome segments are nodes and spanning breakpoints are edges. In MCF7, four clusters emerge at cytobands 1p13.1-p21.1, 3p14.1-p14.2, 17q22-q24.3 and 20q12-q13.33. In HCC1954, five clusters emerge at cytobands 5p15.3, 5q22.3-q23.2, 5q35.2-q35.3, 8q22.2-q24.22, and 11q13.2-q12.3. Moreover, the four MCF7 and five HCC1954 clustered breakpoint locations exactly coincide with high-level amplifications in their respective genomes, indicating possible positive selection and functional significance, see Figure 2A and 2B.
The amplification patterns found in MCF7 and HCC1954 are consistent with the complex firestorm pattern described by Hicks et al. that associate with breast cancer prognostic markers and correspond with poor patient outcomes . Interestingly, the most often detected recurrent locations of firestorm amplification identified by Hicks et al. within the 243 breast tumors studied, reside on chromosomal arms 11q and 17q. These loci contain the genes CCND1 on 11q and ERBB2 on 17q, noted previously to be highly amplified in HCC1954, and may drive selection for these amplifying mutations.
In both cell lines, the remaining non-clustered or dispersed breakpoints are highly associated with low copy repeats (LCRs). The dispersed breakpoints in MCF7 show a 9.8-fold enrichment for LCRs, while the trend is reiterated in HCC1954 with a 9.1-fold enrichment for LCRs. LCR enrichment at dispersed breakpoints is a characteristic previously described in MCF7 , and is recurrently identified in HCC1954. This finding is in contrast to the clustered breakpoints, that do not exhibit enrichment for LCRs.
It is known that chromosomal rearrangements are highly associated with repetitive sequences in genomic disorders and cancer. Up to a quarter of entries in the Gross Rearrangement Breakpoint Database show presence of repetitive elements . The repetitive elements range in size and may be as large as 6 Kb in the case of Long Interspersed Nuclear Elements (LINEs) and may cluster, creating long stretches of non-unique sequence. Breakpoints that overlap repetitive sequence elements may not be detected by 5 Kb (or shorter-range) mate pair libraries. Even if the breakpoint is detected, the non-unique sequence surrounding the rearrangement may make validation by PCR challenging. Having large clonal-sized inserts, fosmid diTags overcome this problem by spanning repetitive sequences and correctly identifying aberrant fusions. For example, in our previous study of MCF7 cells, we identified the expressed DEPDC1B-ELOVL2 chimeric mRNA transcript which is formed by a 5q12.1 intra-chromosomal inversion . This breakpoint is detected using fosmid diTags, but not 5 Kilobase-sized mate pairs due the presence of LINEs, SINEs, and microsatellites surrounding the site of rearrangement.
In many cases, optimal PCR primer design is hindered by the presence of repetitive sequence surrounding the join. This is common when rearrangements are facilitated by homologous recombination [50, 51]. Short repetitive elements or longer segmental duplications (also referred to as low copy repeats) at sites of rearrangement severely limit the number of unique priming positions. Fosmid-sized inserts are able to span such repetitive regions, thus providing means of validating breakpoints even in cases of PCR assay failure. For example, there are two previously published gene truncations identified by our fosmid diTag and 5 Kb mate pair libraries that fail breakpoint spanning PCR assay confirmation. First is the t(5;8)(q35.3;q24.21) translocation in HCC1954 involving the truncation of NSD1, a fusion protein also found in myeloid leukemia . Second is the t(3;15)(p14.1;q23.2) translocation in MCF7 involving the truncation of BRIP1, a BRCA1-interacting protein that contributes to DNA repair . Although these two gene truncations are cross validated by fosmid and 5 Kb sized inserts, PCR assay across the breakpoint results in amplification failure. In these cases, breakpoint spanning primer design was severely hindered due to the presence of interspersed nuclear elements and long terminal repeats across the aberrant joins.
We show that fosmid-sized inserts are adept at spanning repetitive sequences known to exist at sites of gross rearrangement and low copy repeats associated with homologous recombination. Combining fosmid diTag and 5Kb Illumina mate pair libraries we were able to detect and validate aberrant fusions involving repetitive genomic sequence where detection by shorter end sequence profiles alone or validation by breakpoint spanning PCR assays failed. In addition, we observe that those rearrangements detected by both insert size ranges exhibit 3-fold enrichment for cancer-specific somatic mutation and 2-fold reduction in false positives when compared to the 5 Kb mate pairs alone.
For those breast cancer-specific somatic mutations involving genes, we queried transcriptome fusion and truncation literature to corroborate our finding and assess the extent to which our combined fosmid diTag and 5 Kb mate pair libraries rediscovered known chimeric transcripts reported in MCF7 and HCC1954. We identified genomic alterations corresponding to upwards of approximately half of the published MCF7 and HCC1954 chrimeric mRNA transcripts, but it is difficult to assess the lower bound of our sensitivity since it is unclear if the undetected transcript mutations are due to trans-splicing or similar post-transcriptional modifications.
We integrated read density and breakpoint information from mapped fosmid diTags and 5 Kb mate pairs to accurately identify distinct copy number variation in MCF7 and HCC1954. We discovered distinct driver oncogenes associated with high-copy number amplifications in MCF7 and HCC1954. The distinct structural mutability profiles between MCF7 and HCC1954 correlate to their phenotypic differences. Amplified chromosomal segments, breakpoint clusters, and affected genes are located at different positions across the MCF7 and HCC1954 genomes; and correspond to overexpression of different oncogenes, silencing of diverse tumor suppressors, and distinct defects in DNA repair machinery responsible for homology-driven repair of double-stranded DNA breaks. It is intriguing that in conjunction with mutations in the same DNA repair pathway we also find similar patterns of structural mutability in the two cell lines. Both have clustered and dispersed breakpoints; both exhibit clustered breakpoints in regions of high copy number amplification and dispersed breakpoints that are enriched for the presence of low copy repeats.
This project was funded by the NIH-NHGRI grant 1 R01 HG02583 and NIH-NCI grants R33 CA114151 and R21 CA128496 to AM.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.