|Home | About | Journals | Submit | Contact Us | Français|
A powerful way to discover key genes playing causal roles in oncogenesis is to identify genomic regions that undergo frequent alteration in human cancers. Here, we report high-resolution analyses of somatic copy-number alterations (SCNAs) from 3131 cancer specimens, belonging largely to 26 histological types. We identify 158 regions of focal SCNA that are altered at significant frequency across multiple cancer types, of which 122 cannot be explained by the presence of a known cancer target gene located within these regions. Several gene families are enriched among these regions of focal SCNA, including the BCL2 family of apoptosis regulators and the NF-κB pathway. We show that cancer cells harboring amplifications surrounding the MCL1 and BCL2L1 anti-apoptotic genes depend upon expression of these genes for survival. Finally, we demonstrate that a large majority of SCNAs identified in individual cancer types are present in multiple cancer types.
The development of cancer is driven by the acquisition of somatic genetic alterations, including single base substitutions, translocations, infections, and copy number alterations1,2. Recent advances in genome characterization technologies have enabled increasingly systematic efforts to characterize these alterations in human cancer samples3. Identification of these genome alterations can provide important insights into the cellular defects that cause cancer and suggest potential therapeutic strategies2.
Somatic copy-number alterations (SCNAs, distinguished from germline copy-number variations, CNVs; see Supplementary Note 1a) are extremely common in cancer4,5,6. Genomic analysis of tumor samples, by cytogenetic studies and more recently array-based profiling, have identified recurrent alterations associated with particular cancer types4,5,6. In some cases, focal SCNAs have led to the identification of cancer-causing genes and suggested specific therapeutic approaches7,8,9,10,11,12,13,14.
A critical challenge in the genome-wide analysis of SCNAs is distinguishing the alterations that drive cancer growth from the numerous, apparently random alterations that accumulate during tumorigenesis (see Supplementary Note 1b). By studying a sufficiently large collection of tumors, it should ultimately be possible to create a comprehensive, high-resolution catalog of all SCNAs consistently associated with the development of all major types of cancer. Key open questions include: the extent to which significant SCNAs are associated with known cancer-related genes or indicate the presence of new cancer-related genes in particular tumor types; the extent to which large sample collections can be used to pinpoint the precise ‘targets’ of recurrent amplifications or deletions and thereby to identify cancer-related genes (see Supplementary Note 2); and the extent to which SCNAs are restricted to particular types or shared across many cancer types, suggesting common biological pathways.
In this paper, we explore these issues by studying copy-number profiles from 3,131 cancers across more than two dozen cancer types, with the data all derived from a single experimental platform and analyzed with a common, rigorous statistical methodology.
The 3,131 cancer copy-number profiles consisted of 2,509 profiles determined by our laboratory (see references in Supplementary Note 3), including over eight hundred previously unpublished profiles, and 622 profiles determined by other groups11,15,16. The majority (2965) come from 26 cancer types, each represented by more than 20 specimens. Seventeen cancer types are represented by at least 40 specimens each (Supplementary Table 1). Most profiles (2,520) were obtained from tissue specimens, with the remainder from cancer cell lines (541) and melanoma short-term cultures (70).
Copy-number measurements were obtained on a single array platform, the Affymetrix 250K Sty array, containing probes for 238,270 single nucleotide polymorphisms (SNPs). We compared the signal intensities from each cancer specimen to array data from 1480 normal tissue specimens (of which 1140 were paired with cancer specimens from the same individual) to identify regions of somatically generated SCNA. We recorded the genomic position, length, and amplitude of change in normalized copy-number for every SCNA (Supplementary Figure 1a and Supplementary Methods).
We observed a total of 75,700 gains and 55,101 losses across the 3131 cancers, for a mean of 24 gains (median = 12) and 18 losses (median = 12) per sample. For most (17/26) cancer types, the mean number of SCNAs per sample was within two-fold of these overall means (Supplementary Figure 1b). Across all samples, 8.3% of amplification and 8.7% of deletion breakpoints (excluding those occurring within centromeres or telomeres) occured in regions of segmental duplication, which is enrichment relative to the proportion of genome in such regions (5.1% of SNPs; p<10−20 in each case) and likely reflects a predisposition to SCNA formation17. An average of 17% of the genome was amplified and 16% deleted in a typical cancer sample, compared to averages of 0.35% and less than 0.1% in normal samples (representing germline CNVs and occasional analytic artifacts).
Across the entire genome, the most prevalent SCNAs are either very short (focal) or almost exactly the length of a chromosome arm or whole chromosome (arm-level) (Figure 1a). The focal SCNAs occur at a frequency inversely related to their lengths, with a median length of 1.8 Mb (range 0.5 kb – 85 Mb).
Arm-level SCNAs occur approximately 30 times more frequently than would be expected by the inverse-length distribution associated with focal SCNAs (Figure 1a). This observation is seen across all cancer types (Supplementary Figure 2) and applies to both copy gains and losses (data not shown). As a result, in a typical tumor, 25% of the genome is affected by arm-level SCNAs and 10% by focal SCNAs, with 2% overlap. All arm-level (and most focal) SCNAs are of low amplitude (usually single-copy changes), but some focal SCNAs can range to very high amplitude. When analyzing SCNAs for evidence of significant alteration in cancer, we accounted for the difference in background rates between arm-level and focal SCNAs by considering them separately.
Multiple studies have analyzed patterns of arm-level SCNAs across large numbers of cancer specimens4,5,6, and our results are largely in agreement with theirs. We additionally observed that the frequency of arm-level SCNAs decreases with the length of chromosome arms. Adjusted for this trend, the majority of chromosome arms exhibit strong evidence of preferential gain or loss, but rarely both, across multiple cancer lineages (see Figure 1b and Supplementary Note 4).
The large size of arm-level SCNAs makes it difficult to determine the specific target gene or genes. By contrast, mapping of focal SCNAs has great power to pinpoint the important genes targeted by these events7,8,9,10,11,12,13,14.
We determined those regions in which SCNAs occur at a significantly high frequency. For this purpose, we calculated the genome-average “background” rates for SCNAs in our dataset as a function of length and amplitude, and used the Genomic Identification of Significant Targets In Cancer (GISTIC) algorithm18 with improvements as described in Supplementary Methods.
We identified 158 independently regions of significant focal SCNAs, including 76 amplifications and 82 deletions, in the pooled analysis of all our data (Figure 1c and Supplementary Table 2). This number was relatively robust to changes in the number of samples (Supplementary Figure 3a) and removal of individual cancer types from the pooled analysis (Supplementary Figure 3b). Indeed, a stratified analysis of 680 samples distributed evenly across the 17 most highly represented cancer types identified 76% of these significant SCNAs, similar to the number expected based upon the reduced power of this smaller sample set (Supplementary Figure 3a).
The most frequent of these significant focal SCNAs (MYC amplifications and CDKN2A/B deletions) involve 14% of samples, while the least frequent are observed in 2.3% of samples for amplifications and 1.5% for deletions. The frequency of significant arm-level SCNAs is higher (15–29% of samples; Supplementary Figure 3c). These frequencies are likely to be underestimates, as some SCNAs are not detected due to contamination of tumor samples with DNA from adjacent normal cells, technical error, and the incomplete spatial resolution afforded by the SNP array platform.
For each of the 158 significant focal SCNAs, we determined a confidence interval (“peak region”) that has a 95% likelihood of containing the targeted gene (Supplementary Figure 3d). Our large dataset enables more sensitive and high-resolution detection of peak regions than prior copy-number analyses (see Supplementary Note 5 and Supplementary Table 3). An even larger dataset would be desirable, based on analyses showing that the increase in resolution with sample size has not reached a plateau (Supplementary Figure 3e).
The 76 focal amplification peak regions contain a median of 6.5 genes each (range 0–143, including microRNAs). Sixteen regions contain more than 25 genes each; the remaining 60 regions contain in aggregate 364 potential target genes. We found that 25 of the 76 regions (33%) contain functionally validated oncogenes documented to be activated by amplification (Supplementary Table 2), including 9 of the top 10 regions (MYC, CCND1, ERBB2, CDK4, NKX2-1, MDM2, EGFR, FGFR1, and KRAS; Figure 1c, Supplementary Table 2). The tenth region, on 1q, contains nine genes; we present evidence below that the target gene in this region is the anti-apoptotic BCL2 family member, MCL1.
The 82 focal deletion peaks contain a median of seven genes each (range 1–173). Nineteen regions contain at least 25 genes each; the remaining 63 regions contain in aggregate 474 potential target genes. Nine of the 82 regions (11%) contain functionally validated tumor suppressor genes documented to be inactivated by deletion (Supplementary Table 2). Two additional deletions (involving ETV6 and the span from TMPRSS2 to ERG) are associated with translocation events that create oncogenes. Another deletion adjacent to the T-cell receptor beta locus occurs in acute lymphoblastic leukemia and likely is not associated with cancer, as it occurs during normal T-cell development.
The remaining 70 deletion peaks do not contain known tumor suppressor genes, translocation sites, or somatic rearrangements. Over one-third (26) contain large genes, whose genomic loci span more than 750 kb; none of these genes has been convincingly demonstrated to be a tumor suppressor gene. Conversely, 19 of the 40 largest genes in the genome occur in deletion peaks (Figure 2a; p = 3×10−9). This association between deletions and large genes could be due to a propensity for both to occur in regions of low gene density. Indeed, large genes tend to occur in gene-poor regions (Figure 2a, bottom), and an analysis of all SCNAs in the dataset reveals that deletions (but not amplifications) show a bias toward regions of low gene density (up to 30% below the genome average; Figure 2b). Even after removing the 26 SCNAs containing large genes, the gene density among the remaining deletions is still 25% below the genome average. These observations suggest that some of the deletions may not be related to cancer etiology, but rather may reflect a high frequency of deletion or low levels of selection against deletion in these regions.
The majority of known amplified oncogenes reside within the 76 amplified regions, although there are exceptions. For example, MITF19 is likely undetected because it is a lineage-specific oncogene restricted to melanoma. At least 10 known deleted tumor suppressor genes do not reside in the deleted regions in the pooled analysis (BRCA2, FBXW7, NF2, PTCH1, SMARCB1, STK11, SUFU, VHL, WT1, and WTX). Some of these are specific to cancer types not represented in our dataset (e.g. NF2, WT1, and WTX), while others primarily suffer arm-level deletions (with possible additional deletions beyond the resolution of the array platform) (e.g. BRCA2, FBXW7, STK11 and VHL). Other tumor suppressor genes may be missed if they lie within regions whose background deletion rates are lower than the genome-wide average, or if they are adjacent to genes whose deletion is poorly tolerated (which would be expected to occur more readily in regions of high gene density) (see Supplementary Note 1). Such tumor suppressors might be inactivated by point mutations more often than SCNAs.
We assessed potential cancer-causing genes in the SCNAs using GRAIL (Gene Relationships Among Implicated Loci20), an algorithm that searches for functional relationships among genomic regions. GRAIL scores each gene in a collection of genomic regions for its ‘relatedness’ to genes in other regions based on textual similarity between published abstracts for all papers citing the genes, on the notion that some target genes will function in common pathways.
We found that 47 of the 158 peak regions (34 of the 76 amplification peaks and 13 of the 82 deletion peaks) contain genes significantly related to genes in other peak regions (Figure 2c). In 21 of these regions, the highest-scoring gene was a previously validated target of SCNA in human cancer (Supplementary Table 2). Across all peak regions, the literature terms most significantly enriched refer to gene families important in cancer pathogenesis, such as kinases, cell cycle regulators, and MYC family members (Figure 2d, top; Supplementary Table 4).
To discover new genes, we next examined the 122 regions without previously documented SCNA targets. The most significantly enriched literature term associated with the amplification peaks was “apoptosis” (Figure 2d, bottom; Supplementary Table 4). Two of the five known anti-apoptotic members of the BCL2 family21 (MCL1 and BCL2L1) are in amplification peaks. Two of 11 pro-apoptotic members (BOK and BBC3) were also found among deletion peaks, for a total of four of the 16 known BCL2 family members, with anti-apoptotic genes amplified but not deleted and vice versa for pro-apoptotic genes (Figure 3a; p = 3e-10). Although some BCL2 family members are known to be translocation and point mutation targets22,23,24,25,26, pathway dysregulation by copy-number change has not been well-described. Below, we describe functional validation that MCL1 and BCL2L1 are targets of amplifications that encompass them.
The second-ranking term among amplification peaks without known targets was “NF-κB”, reflecting a preponderance of members of this pathway (TRAF6, IKBKB, IKBKG, IRAK1, and RIPK1; p = 0.001 for pathway enrichment27) and consistent with an emerging recognition of its importance in multiple cancer types28,29,30.
Because some gene families may have been missed by GRAIL, we separately analyzed gene ontology (GO) terms for association with amplification peaks (data not shown). We identified significant enrichment of genes associated with “molecular adaptor activity” (GO: 0060090, p=4e-10), including IRS2, GRB2, GRB7, GAB2, GRAP, TRAF2, TRAF6, and CRKL. IRS2 and GAB2 are known to be transforming when overexpressed31,32, and CRKL has been reported as an essential gene among cells in which it is amplified33.
MCL1 is one of nine genes in an amplification peak in cytoband 1q21.2 (Figure 3b and Supplementary Table 2) with focal amplifications observed in 10.9% of cancers across multiple tissue types. Fluorescence in situ hybridization (FISH) of the MCL1 region in lung and breast cancers revealed much higher rates of focal amplification (Supplementary Figure 4a–b). Amplifications of 1q21.2 were previously reported in lung adenocarcinoma and melanoma7,34,35, but the peak regions in those studies contained 86, 36 and 53 genes, respectively.
We examined whether cell growth depends upon MCL1 in the presence of gene amplification by measuring the rate of change in cell number after activating an inducible shRNA against MCL1 in cells with and without 1q21.2 amplification. We observed a more pronounced reduction in proliferation rates among four MCL1-amplified cell lines, compared to three MCL1 non-amplified control cell lines (p = 0.05; Figure 3c) (all achieved >70% knockdown; Supplementary Figure 4c). Reducing the expression of 6 of the other genes (all by >70%; Supplementary Figure 4d) within the 1q21.2 amplicon in NCI-H2110 cells produced no significant effects (Figure 3d). Similar effects were observed following MCL1 depletion with multiple shRNAs and siRNAs (Supplementary Figure 4e). Growth of NCI-H2110 xenografts were also inhibited by induction of anti-MCL1 shRNA (Figure 3e).
BCL2L1 is one of five genes in a peak region of amplification on 20q11.21 (Supplementary Figure 5a). Amplifications of this region have been previously noted in lung cancer36, giant-cell tumor of bone37, and embryonic stem cell lines (the latter also amplifying a region including BCL2)38,39, but functional validation of BCL2L1 as a gene targeted by these amplifications has not been reported. We examined BCL2L1 dependency using shRNA against BCL2L1 in cells with and without 20q11.21 amplification. We observed a more pronounced reduction in proliferation rates among six BCL2L1-amplified lines (including SKLU1, which was MCL1-independent), compared to seven BCL2L1 non-amplified lines (p = 0.006; Figure 3f). These decreased proliferative rates were associated with increased apoptosis (Supplementary Figure 5b).
We then sought to explore how amplification of these BCL2 family members might act in cancer by examining other SCNAs found in cancers carrying MCL1 or BCL2L1 amplifications. The most frequent additional focal SCNA in these cancers was amplification of the region carrying MYC (62% and 69%, respectively). BCL2 has previously been shown to reduce MYC-induced apoptosis in lymphoid cells40. We found that over-expression of MCL1 and BCL2L1 in immortalized bronchial epithelial cells also reduces MYC-induced apoptosis (Supplementary Figure 5c–d). Oncogenic roles for MCL1and BCL2L1 have been previously suggested by reports of increased rates of lymphoma and leukemia in transgenic mice41,42. Somatic amplification of MCL1 and BCL2L1 may therefore be a common mechanism for cancers, including carcinomas, to increase cell survival.
Our analysis of a large number of cancer types with a high-resolution platform afforded an opportunity to quantify the degree to which significant focal SCNAs are shared across cancer types. We performed separate analyses of each of the 17 cancer types represented by at least 40 samples and compared the significant SCNAs to those from a pooled analysis of the remaining samples, excluding the cancer type in question.
The majority of focal SCNAs identified in any one of these 17 cancer types are also found in the pooled analysis excluding that cancer type (median 79% overlap, versus 10% for randomly permuted regions, p < 0.001; Figure 4) and, indeed, in the 158 regions from the overall pool. Nonetheless, cancer type-restricted analyses identified an additional 199 significant SCNAs (145 regions of amplification, 54 regions of deletion, Supplementary Table 5). (These exclude 79 regions of amplification on chromosome 12 found only in dedifferentiated liposarcomas and likely to be related to the ring chromosomes in that disease43). However, even many of these regions were found to occur in more than one cancer type (median two). As would be expected, the 158 regions in the pooled analysis were found in more cancer types (median five) and were better localized (median size 1.5 Mb vs. 11 Mb in the lineage-restricted analyses).
Arm-level alterations, like focal SCNAs, tend to be shared among multiple cancer types (Supplementary Note 4). Prior studies have demonstrated a tendency for cancers of similar developmental lineages to exhibit similar recurrent arm-level SCNAs44. We found that this tendency was much more apparent for arm-level than focal SCNAs (see Supplementary Note 6), suggesting that arm-level SCNAs are shaped to a greater extent by developmental context.
The raw data and analyses from this study are available at www.broadinstitute.org/tumorscape, including segmented copy-number data (viewable using the Integrative Genomics Viewer [Robinson et al, in preparation]) and profiles describing the significance of copy-number changes. The portal also supports gene copy-number queries across and within individual cancer types (instructions are in Supplementary Note 7).
This study represents the largest analysis to date of high-resolution copy-number profiles of cancer specimens. Several features of the copy-number landscape apply to the vast majority of cancer types. There is a strikingly high prevalence of arm-level SCNAs4,5,6, which likely reflects the ease with which such mutational events occur compared to focal events45,46. The analysis also reveals a strong tendency for significant focal SCNAs in one cancer type to be also found in multiple others.
We identified a total of 357 significant regions of focal SCNA, including 158 regions in the pooled analysis and 199 regions in analyses of individual cancer types. These are surely underestimates of the number of regions that are significantly altered in cancer. Many cancer types were represented by relatively few samples; others were not represented at all. Some SCNAs were missed due to the resolution limit of the array platform. Further efforts will be needed to characterize larger numbers of cancer genomes at higher resolution to create a comprehensive catalog of the significant SCNAs and define their occurrence in difference cancer types.
A key challenge is to identify the cancer gene targets of each of these SCNAs. Fewer than one-quarter of the 158 common peak regions are associated with previously validated targets of SCNAs in human cancer. While a subset of the SCNAs may represent deletion events that are tolerated but not causally involved in cancer (as suggested by the correlation with gene-poor regions) or frequent due to mechanistic bias (e.g. associated with fragile sites)47, many more cancer-causing genes are likely to be found through analysis of SCNAs. The GRAIL analysis of our peak regions points to more than a dozen likely candidates, while the functional analysis of MCL1 and BCL2L1 strongly implicates these genes as amplification targets. Moreover, some SCNAs may contain multiple functional targets10.
Identification of the target genes will require both genomic and functional studies. For focal events, the copy-number profiles of additional samples at higher resolution can help narrow the lists of candidates. Nucleotide sequencing may identify point mutations, especially in the case of heterozygous deletions. Because overlapping SCNAs in different cancer types may target different genes, all candidates should be functionally tested separately in each cancer type in appropriate model systems.
While many canonical oncogenes and tumor suppressor genes are known to be altered across multiple cancer types and functionally relevant in model systems of diverse tissue origins1, it has not been clear whether these genes are typical or represent a discovery bias toward genes relevant to multiple cancer types. By studying a large number of cancers of multiple types, we have found that most of the significant SCNAs within any single cancer type tend to be found in other cancer types as well. Similar findings for point mutations and translocations would suggest that the appearance of tremendous diversity across cancer genomes may reflect combinations of a limited number of functionally relevant events.
DNA extracted from cancer specimens and normal tissue was labeled and hybridized to the Affymetrix 250K Sty I array to obtain signal intensities and genotype calls. Signal intensities were normalized against data from 1480 normal samples. Copy-number profiles were inferred using GLAD48 and changes of > 0.1 copies in either direction were called SCNAs. The significance of focal SCNAs (covering < 0.5 chromosome arms) was determined using GISTIC18, with modifications to score SCNAs directly proportional to amplitude and to allow summation of non-overlapping deletions affecting the same gene. Peak region boundaries were determined so that the change in the GISTIC score from peak to boundary had < 5% likelihood of occurring by random fluctuation. P-values for Figures 2b and and44 were determined by comparing the gene densities of SCNAs and fraction overlap of peak regions respectively to the same quantities calculated from random permutations of the locations of these SCNAs and peak regions. RNAi was performed by inducible and stable expression of shRNA lentiviral vectors and by siRNA transfection. Proliferation in inducible shRNA experiments was measured in triplicate every half-hour on 96-well plates by a real time electric sensing system (ACEA Bioscience) and in stable shRNA expression and siRNA transfection experiments by CellTiterGlo (Promega). Apoptosis was measured by immunoblot against cleaved PARP and FACS analysis of cells stained with antibody to annexin V and propidium iodide. Tumor growth in nude mice was measured by caliper twice weekly. Expression of MYC, MCL1, and BCL2L1 was performed with retroviral vectors in lung epithelial cells immortalized by introduction of SV40 and hTERT49.
Full methods are described in Supplementary Methods.
This work was supported by grants from NIH (Dana-Farber/Harvard Cancer Center and Pacific Northwest Prostate Cancer SPOREs, P50CA90578, R01CA109038, R01CA109467, P01CA085859, P01CA 098101, and K08CA122833) the Doris Duke Foundation, the Sarah Thomas Monopoli Lung Cancer Research Fund, the Seaman Corporation Fund for Lung Cancer Research, and the Lucas Foundation. Medulloblastoma samples were obtained in collaboration with the Children’s Oncology Group. Natalie Vena provided technical assistance with FISH, while Ingo Mellinghoff, Paul S. Mischel, Linda Liau, and Tim F. Cloughesy provided DNA samples. We thank Thomas Ried, Robert Weinberg, and Bert Vogelstein for critical review of the manuscript and insightful comments regarding its context in the field of cancer genetics.
Competing Interests Statement
The authors declare that they have no competing financial interests.
Author ContributionsRB, CHM, ESL, GG, WRS, and MM conceived and designed the study; RB, JB, MU, AHL, YC, WW, BAW, DYC, AJB, JP, SS, EM, FJK, HS, JET, JAF, JT, JB, MST, FD, MAR, PAJ, CN, RLL, BLE, SG, AKR, CRA, ML, LAG, ML, DGB, LDT, AO, SLP, SS, and MM contributed primary samples and/or assisted in the generation of the data; RB, CHM, SR, JD, MSL, BAW, MJD, and GG performed the data analysis; RB, DP, GW, JD, JSB, KTM, LH, HG, KET, AL, CH, DY, AL, LAG, TRG, and MM designed and performed the functional experiments on BCL2 family member genes; RB, CHM, RMP, MR, TL, and QG designed and built the cancer copy number portal; RB, CHM, ESL, and MM wrote, and all other authors have critically read and commented on, the manuscript.
SNP array data have been deposited to GEO under accession number GSE19399.