|Home | About | Journals | Submit | Contact Us | Français|
The rapid growth of cancer genome structural information provides an opportunity for a better understanding of the mutational mechanisms of genomic alterations in cancer and the forces of selection that act upon them. Here we test the evidence for two major forces, spatial chromosome structure and purifying (or negative) selection, that shape the landscape of somatic copy-number alterations (SCNAs) in cancer1. Using a maximum likelihood framework we compare SCNA maps and three-dimensional genome architecture as determined by genome-wide chromosome conformation capture (HiC) and described by the proposed fractal-globule (FG) model2,3. This analysis provides evidence that the distribution of chromosomal alterations in cancer is spatially related to three-dimensional genomic architecture and additionally suggests that purifying selection as well as positive selection shapes the landscape of SCNAs during somatic evolution of cancer cells.
Somatic copy-number alterations (SCNAs) are among the most common genomic alterations observed in cancer, and recurrent alterations have been successfully used to implicate cancer-causing genes1. Effectively finding cancer-causing genes using a genome-wide approach relies on our understanding of how new genome alterations are generated during the somatic evolution of cancer4–7. As such, we test the hypothesis that three-dimensional chromatin organization and spatial co-localization influences the set of somatic copy-number alterations observed in cancer (Fig. 1A, recently suggested by cancer genomic data in a study of prostate cancer8. Spatial proximity and chromosomal rearrangements are discussed more generally9–12). Unequivocally establishing a genome-wide connection between SCNAs and three-dimensional chromatin organization in cancer has until now been limited by our ability to characterize three-dimensional chromatin architecture, and the resolution with which we are able to observe SCNAs in cancer. Here, we ask whether the “landscape” of SCNAs across cancers1 can be understood with respect to spatial contacts in a 3D chromatin architecture as determined by the recently developed HiC method for high-throughput chromosome conformation capture2 or described theoretically via the fractal globule (FG) model (theoretical concepts13, review3). Specifically, we investigate the model presented in Figure 1A, and test whether distant genomic loci that are brought spatially close by 3D chromatin architecture during interphase are more likely to undergo structural alterations and become end-points for amplifications or deletions observed in cancer.
Towards this end, we examine the statistical properties of SCNAs in light of spatial chromatin contacts in the context of cancer as an evolutionary process. During the somatic evolution of cancer14,15 as in other evolutionary processes, two forces determine the accumulation of genomic changes (Fig. 1A): generation of new mutations and fixation of these mutations in a population. The rate at which new SCNAs are generated may vary depending upon the genetic, epigenetic, and cellular context. After an SCNA occurs, it proceeds probabilistically towards fixation or loss according to its impact upon cellular fitness. The fixation probability of an SCNA in cancer depends upon the competition between positive selection if the SCNA provides the cancer cell with a fitness advantage, and purifying (i.e. negative) selection if the SCNA has a deleterious effect on the cell. The probability of observing a particular SCNA thus depends upon its rate of occurrence via mutation, and the selective advantage or disadvantage conferred by the alteration (Fig. 1A). Positive, neutral, and purifying selection are all evident in cancer genomes16.
Our statistical analysis of SCNAs argues that both contact probability due to chromosomal organization at interphase and purifying selection contribute to the observed spectrum of SCNAs in cancer. From the full set of reported SCNAs across 3,131 cancer specimens in1, we selected 39,568 intra-arm SCNAs (26,022 amplifications and 13,546 deletions) longer than a megabase for statistical analysis, excluding SCNAs which start or end in centromeres or telomeres. To establish that our results were robust to positive selection acting on cancer-associated genes, we analyzed a collection of 24,301 SCNAs that do not span highly-recurrent SCNA regions (16,521 amplifications and 7,789 deletions, respectively 63% and 58% of the full set1, see Methods). We present results for the less-recurrent SCNAs, and note that our findings are robust to the subset of chosen SCNAs. We performed our analysis by considering various models of chromosomal organization and purifying selection, which were used to calculate the likelihood of the observed SCNA given the model. The likelihood framework was then used to discriminate between competing models. Statistical significance was further evaluated using permutation tests. The strong association we find between SCNAs and high-order chromosomal structure is not only consistent with the current understanding of the mechanisms of SCNA initiation17, but provides insight into how spatial proximity may be arrived at via chromosomal architecture and the significance of chromosomal architecture for patterns of SCNAs observed at a genomic scale.
The initial motivation for our study was an observation that the length of focal SCNAs and the length of chromosomal loops (i.e. intra-chromosomal contacts) have similar distributions (Figs. 1B and 1C), both exhibiting ~ 1/L scaling. Analysis of HiC data for human cells showed that the mean contact probability over all pairs of loci a distance L apart on a chromosome goes as PHiC (L) ~ 1/L for a range of distances L= 0.5 to 7Mb2. This scaling for mean contact probability was shown to be consistent with a fractal globule (FG) model of chromatin architecture. Similarly, the mean probability to observe a focal SCNA of length L goes approximately as PSCNA (L) ~ 1/L for the same range of distances L= 0.5 to 10 Mb as noted in 1. Mathematically, the observation that the mean probability to observe an SCNA decays with its length is quite significant. If two SCNA ends were chosen randomly within a chromosome arm, the mean probability to observe an SCNA of length L would remain constant. Positive selection, which tends to amplify oncogenes or delete tumor suppressors, again does not give rise to a distribution whose mean decreases with length. Either purifying selection or a length-dependent mutational mechanism is required to observe this result.
The connection between three-dimensional genomic architecture and SCNA structure goes beyond the similarity of their length distributions: loci that have higher probability of chromosomal contacts are also more likely to serve as SCNA end points (Fig. 2). To quantitatively determine the relationship between three-dimensional genomic architecture and SCNA, both data sets were converted into the same form. For each chromosome, we represent HiC data as a matrix of counts of spatial contacts between genomic locations i and j as determined in the GM06990 cell line using a fixed bin size of 1 Mb2. Similarly, we construct SCNA matrices by counting the number of amplifications or deletions that start at genomic location i and end at location j of the same chromosomes across the 3,131 tumors. Figure 2 presents HiC and SCNA matrices (heatmaps) for chromosome 17. Away from centromeric and telomeric regions, which are not considered in this analysis, the SCNA heatmap appears similar to the HiC heatmap (Pearson’s r = .55, p < 0.001, see Supplementary Table S1 for other chromosomes). In particular, regions enriched for 3D interactions also appear to experience frequent SCNA. Since the Pearson correlation coefficient is not suited for describing rare probabilistic events like SCNAs, for further analysis we employ the Poisson likelihood, a widely-used method to statistically analyze rare events18.
To further test the role of chromosome organization for the generation of SCNA, we developed a series of statistical models of possible SCNA-generating processes, computed the Poisson likelihoods of the SCNA data given these models (see Eq. 6), and performed model selection using their Bayesian Information Criterion (BIC) values, which is the log-likelihood of a given model penalized by its number of fitting parameters (see Eq. 7). Considered models take into account different mechanisms of the generation of SCNA, with a mutation rate either: uniform in length (Uniform), derived from experimentally determined chromatin contact probabilities (HiC) or derived from contact probability in the fractal globular chromatin architecture (FG). We note that the FG model specifies a contact probability that depends on the distance between genomic loci, but does not include positional differences at a given distance.
We also took into account possible deleterious effects of SCNAs due to purifying selection, which can lead to a reduced probability of fixation (see Eq. 1). Deleterious effects of SCNAs on cellular fitness may arise from the disruption of genes or regulatory regions; as such, we expect longer SCNAs to be more deleterious. A relationship between SCNA length and its deleterious effect on cellular fitness is supported by the observation that whole-arm SCNAs are less likely for longer chromosomal arms1, as well as an observation of linearly decreasing bacterial growth rate with longer amplifications19. If we assume that the deleterious effect of an SCNA increases linearly with its length L, and consider the somatic evolution of cancer as a Moran process15,20, we find that the probability of fixation decays roughly exponentially with length at a rate that reflects the strength of purifying selection (see Eq. 4, Fig. 1B). Combining the effects of purifying selection on fixation probability with the mutational models leads to the following six models: Uniform, Uniform+sel, HiC, HiC+sel, FG, FG+sel, with no fitting parameters for models without selection and a single fitting parameter for selection, where the additional parameter is penalized via BIC.
Model selection provides two major results (Fig. 3): First, among models of SCNA generation, a model that follows the chromosomal contact probability of the fractal globule (~ 1/L) significantly outperforms other models. Second, since considering purifying selection helps fit the observed roll-over in the number of SCNAs at longer distances (L >20Mb, Fig. 1B), every model is significantly improved when purifying selection is taken into account (p < .001 via bootstrapping), suggesting that SCNAs experience purifying selection. We note that the additional decline in the number of SCNA at long distances could possibly be due to alternative chromatin-independent mechanisms that further disfavor the formation of exceptionally long SCNAs. Figure 3 presents log-likelihood ratios of the models (with and without purifying selection) with respect to the uniform model. If models are fit on a chromosome-by-chromosome basis (Supplementary Fig. 2) we observe that for long chromosomes, the FG model fits better than purifying selection alone. We also find that the best-fit parameter describing purifying selection is proportional to chromosome length (Supplementary Fig. 2A). Since smaller values for the best-fit parameter correspond to stronger purifying selection, these two results suggest that short, gene-rich, chromosomes may experience greater purifying selection. However, we note that purifying selection proportional to the genomic length of an SCNA fits the data better than purifying selection proportional to the number of genes affected by an SCNA (Supplementary Fig. 3).
We next tested whether the position-specific structure of chromosomal contacts observed in experimental HiC data, and absent for the FG, was evident in the SCNA landscape. The test was performed using permutation analysis (Fig. 4). Since both the probability of observing an SCNA with a given length and intra-chromosomal contact probability in HiC depend strongly on distance L, we permuted SCNAs in a way that preserves this dependence but destroys the remaining fine structure. This is achieved by randomly reassigning SCNA starting locations within the same chromosomal arm, while keeping their lengths fixed. We find that HiC fits the observed SCNAs much better than it fits permuted SCNAs (Fig. 4, p<.001). Similar analysis within individual chromosomes shows that the fit is better for 18 of the 22 autosomal chromosomes, except for chromosomes 10, 11, 16, and 19, and is significantly better (p<.01) for nine chromosomes 1, 2, 4, 5, 7, 8, 13, 14, and 17 (Fig. 4B). While the observed amplification and deletions each separately fit better on average than their permuted counterparts (Supplementary Fig. 5), deletions fit considerably better than amplifications (p<0.001 vs. p<0.05).
Finally, we examined the possible influence of chromosomal compartments (domains, as determined in2) on the landscape of SCNAs by fitting models where SCNA formation is favored if both ends are in the same type of domain (see Methods). Maximizing the likelihood of this two-parameter FG+domains model demonstrated a marginal increase in the BIC-corrected likelihood above the FG model for deletions, and not for amplifications (Supplemental Fig. 8 and 9). The best-fitting domain strength parameter values favored small (10–20%) increases in the relative probability of intra-domain SCNAs. Additionally, the best-fitting FG+domains model shows a smaller amount of position-specific information than HiC, as determined by permutation tests (Supplemental Fig. 8).
Our genome-wide analysis of HiC measurements and cancer SCNA finds multiple connections between higher-order genome architecture and re-arrangements in cancer. Using an incisive likelihood-based BIC framework, we found that: (1) probability of a 3D contact between two loci based on the FG model explains the length distribution of SCNA better than other mechanistic models or than a model of purifying selection alone; (2) comparisons with permuted data demonstrate the significant connection between megabase-level position-specific 3D chromatin structure observed in HiC and SCNA; (3) a multiplicative model favoring intra-domain SCNAs provides little improvement beyond the FG model and has less position-specific information than HiC; (4) SCNA data reflect mutational mechanisms and purifying selection, in addition to commonly considered positive selection.
These results argue strongly for the importance of 3D chromatin organization in the formation of chromosomal alterations. While the distribution of SCNAs could conceivably depend on a complicated mutation and selection landscape, which is merely correlated with 3D genomic structure, a direct explanation via 3D genomic contacts is more parsimonious. Along these lines, two recent experimental studies of translocations suggest that physical proximity is among the key determinants of genomic rearrangements21,22. Additionally, a genome-wide analysis of translocations across cancers demonstrates an enrichment of translocations among chromosomal loci with greater numbers of experimentally determined chromosomal contacts23.
Genomic architecture may vary with cancer cell type of origin and the specific chromatin states of these cells24,25, thus influencing the set of observed SCNAs in each cancer type; for example, re-arrangement breakpoints in prostate cancer were found to correlate with loci in specific chromatin states of prostate epithelial cells8. In fact, if HiC data matching the tumor cell-types of origin for the set of observed SCNAs becomes available, we may find that the cell-type specific experimental 3D contacts fit the observed distribution of SCNAs better than the fractal globule model. Despite this limitation, when we perform a permutation analysis on SCNAs grouped by cancer lineage (epithelial, hematopoietic, sarcomas and neural), we still find that HiC fits the observed SCNAs significantly better than it fits permuted SCNAs consistently across cancer lineages for deletions, but not for amplifications (Supplementary Fig. 6).
Differences between amplifications and deletions (Supplementary Figs. 4, 5, 6) may reflect differences in the strength of selection and mechanisms of genomic alteration: conceivably a simple loss of a chromosomal loop could lead to a deletion, while amplifications may occur through more complicated processes17 and may require interactions with homologous and non-homologous chromosomes that are not necessarily directly related to intra-chromosomal spatial proximity during interphase.
Our results suggest that a comprehensive understanding of mutational and selective forces acting on the cancer genome, not limited to positive selection of cancer-associated genes, is important for explaining the observed distribution of SCNAs. Furthermore, comparing model goodness-of-fits for the distribution of SCNAs argues that purifying selection is a common phenomenon, and that many SCNAs in cancer may be mildly deleterious “passenger mutations” (reviewed in26,27). We note that while we find evidence for both chromatin organization and purifying selection in the length distribution of SCNAs, in our best-fitting model, 3D chromatin architecture explains a factor of ~100 in relative frequencies of SCNAs, whereas purifying selection contributes an additional factor of ~3 for long SCNAs (L > 20–100Mb) and has little effect on the frequency of shorter SCNAs (L < 20Mb). Presumably, mechanisms other than purifying selection could lead to additional suppression of excessively long SCNAs. However, the observed exponential rollover in the number of SCNAs at long distances is unlikely to be caused by limitations arising from SCNA mapping, since whole-arm SCNAs are successfully detected at high frequencies.
The sensitivity and relevance of comparative genomic approaches to chromosome rearrangements can only increase as additional HiC-type datasets become available. Future studies will be able to address the importance of different 3D structures to the observed chromosomal rearrangements across cell types and cell states. Perhaps even more importantly, cancer genomic sequencing data will allow for significantly more detailed analyses than the current array-based approaches, allowing for greater mechanistic insight into SCNA formation. In particular, high-throughput whole-genome sequencing data will allow for both a high-resolution analysis of interchromosomal rearrangements and yield insight into the interplay between sequence features, chromatin modifications, and 3D genomic structure.
We generated SCNA heatmaps from the data of Beroukhim et al.1 who reported a total of 75,700 amplification and 55,101 deletion events across 3,131 cancer specimens; reported events are those with inferred copy number changes >.1 or <!.1, due to experimental limitations. We restricted our analysis to intra-arm SCNAs which do not start/end near telomeric/centromeric regions separated by more than one megabase bin, giving a set of 39,568 SCNAs (26,022 amplifications and 13,546 deletions). We note that SCNAs starting/ending in centromeres/telomeres (which include full-arm gain/loss) display a very different pattern of occurrence from other focal SCNAs, particularly in terms of their length distribution, which may indicate a different mutational mechanism. Requiring a separation of greater than one megabase bin is due to resolution limits of both SCNA and HiC data (see Supplementary Fig. 1 for details). SCNA matrices are constructed by counting the number of amplifications or deletions starting at Mb i and ending at Mb j of the same chromosomes. Similarly, HiC heatmaps were generated by counting the number of reported interactions2 between Mb i and j of the same chromosome in human cell line GM06690.
To test the respective contributions of mutational and selective forces on the distribution of SCNAs, we consider the probability of observing at SCNA that starts and end at i and j
as the product of the probability of a mutation, i.e. an SCNA to occur in a single cell μij, and the probability to have this mutation fixed in the population of cancer cells π(L), where L = |i − j is the SCNA length. The mutation probability μij depends on the model that describes the process leading to chromosomal alterations: (Uniform) two ends of an alteration are drawn randomly from the same chromosomal arm, giving ; (HiC) the probability of an alteration depends on the probability of a 3D contact between the ends as given by HiC data, ; (FG) the probability of alteration depends upon the probability of 3D contact according to the fractal globule model, i.e. on SCNA length L: . The probability of fixation depends on the fitness of a mutated cell as compared to non-mutated cells (see below). Each mutational model is considered by itself and in combination with purifying selection, giving six models: Uniform, HiC, FG, Uniform+sel, HiC+sel, and FG+sel. For example, . The additional parameter describing selection is accounted for using BIC (described below).
We also examined a mutational model which combines the effects of chromosomal compartments as determined by HiC2 with the FG model (FG+domains). Domains are brought into our models by assuming different likelihoods of SCNA ends to be located active-active, active-inactive and inactive-inactive domains (two independent parameters). This domain structure is then multiplied by the fractal globule contact probability, , where Dij = 1 if i and j are in different domains, Dij = κ if i and j are both in an open domain, and Dij = ν if i and j are both in a closed domain. We exclude chromosomes 4, 5, and 15 for the domain analysis, as these chromosomes have a poor correspondence between HiC domains (as determined in the original analysis of HiC2) and the HiC contact map.
Two major selective forces act on SCNAs: positive selection on SCNAs that amplify an oncogene or delete a tumor suppressor, and purifying selection that acts on all alterations. Purifying selection results from the deleterious effects of an SCNA that deletes or amplifies genes and regulatory regions of the genome that are not related to tumor progression. We assume that deleterious effect of an SCNA, and the resulting reduction in cells fitness ΔF, is proportional to SCNA length: |ΔF| L.
where ΔF is a relative fitness difference (selection coefficient), N is the effective population size. For weakly deleterious mutations (ΔF < 0, N|ΔF| 1, |ΔF| 1)
Note that for sufficiently deleterious mutations this leads to an exponentially suppressed probability of fixation: π(ΔF) exp(ΔFN) (ΔF < 0), a useful intuitive notion. Assuming a deleterious effect linear in SCNA length, ΔF = −L/λ, we obtain the probability of fixation for purifying selection acting on an SCNA
where C is an arbitrary constant obtained from normalization of P(L), and α = λ/N is a fitting parameter which quantifies the strength of purifying selection. For gene-based purifying selection, L is simply replaced by the number of genes altered. Mutations that are selectively neutral have no length dependence, so π(L) = C, and thus Pij ~ μij.
Positive selection acting on cancer-associated genes (eg. oncogenes and tumor suppressors) presents a possible confounding factor to our analysis. To establish that our results were robust to positive selection acting on cancer-associated genes, we analyzed the subset of the 39,568 SCNAs (26,022 amplifications and 13,546 deletions) that do not span highly-recurrent SCNA regions identified by GISTIC with a false-discovery rate q-value for alteration of <.25 as listed in Beroukhim et al.1, a collection of 24,310 SCNAs (16,521 amplifications and 7,789 deletions, respectively 63% and 58% of the full set). After SCNAs spanning highly-recurrent regions are removed, permutations are performed under the constraint that permuted SCNAs do not cross any of the highly-recurrent regions. Positive selection can also be somewhat controlled for by setting a threshold on the inferred change in copy number, to filter SCNAs that may have experienced strong positive selection in individual cancers. We note that our findings are robust to the subset of chosen SCNAs, most likely because there are many fewer driver SCNAs than passenger SCNAs (Supplementary Fig. 7).
Since the occurrence of a particular SCNA starting at i and ending at j is a rare event, we evaluate the relative ability of a model to predict the observed distribution of SCNA by calculating the Poisson Log-likelihood of the data given the model:
where is dictated by the model as explained above, and SCNAij is the number of SCNAs that start and end at i and j. Since recurrent regions of amplification and deletion are different, we calculate the log-likelihood separately for amplifications and deletions, and then aggregate across these two classes of SCNAs. After the log-likelihood is calculated, models are ranked and model selection is performed using Bayesian Information Criterion (BIC). BIC penalizes models based upon their complexity, namely their number of parameters. Penalizing k additional parameters for n observed SCNAs using Bayesian Information Criterion (BIC) is straightforward:
where models with higher BIC are preferred28. For the permutation analysis, log-likelihood is calculated in the same way, first for the observed SCNAs, and then for permuted sets of SCNAs.
We thank members of the Mirny Lab for helpful conversations, in particular with Christopher McFarland regarding purifying selection and Maxim Imakaev regarding fractal globules. We thank Craig Mermel for an introduction to SCNA data. We thank Vineeta Agarwala, Jesse Engrietz, Rachel McCord, and Job Dekker for helpful comments and suggestions. This work was supported by the NIH/NCI Physical Sciences Oncology Center at MIT (U54CA143874)