Sequencing of a SCLC cell line
Most small cell lung cancers are not surgically resected7
, meaning that cell lines are an indispensable resource for studying this disease. NCI-H209 is an immortal cell line derived from a bone marrow metastasis of a 55 year old male with SCLC, taken before chemotherapy16
. The smoking history of the patient is not recorded16
. However, the specimen showed histologically typical small cells with classic neuroendocrine features: >97% of such tumours are associated with tobacco-smoking17,18
. An EBV-transformed lymphoblastoid line, NCI-BL209, has been generated from the patient. NCI-H209 has been extensively characterised by spectral karyotyping, capillary sequencing and high-resolution copy-number array (http://www.sanger.ac.uk/genetics/CGP/cosmic/
Using the SOLiD platform, we generated 25bp short-read, mate-pair shotgun sequences from the tumour and matched normal genomes. Based on detailed power calculations, we estimated that tumour and normal genomes should be sequenced to 30-fold depth to identify somatically acquired genetic variants with high sensitivity and distinguish them from both sequencing errors and germline polymorphisms (). In total, 112Gb (39x coverage) from the tumour and 90Gb (31x) from the normal were aligned to the reference genome ().
Figure 1 The compendium of somatic mutations in a small cell lung cancer genome. (A) Power calculations showing the number of true somatic substitutions detected (blue) and mis-calls (SNPs called as somatic mutations, burgundy, and sequencing errors called as (more ...)
Bioinformatic algorithms were developed to identify somatically acquired genetic variation from the sequencing data (supplementary figure 1, supplementary tables 1-5
), subjected to rigorous validation by PCR and capillary sequencing. We had previously identified 29 base substitutions, of which 22 (76%) were called by our algorithm from the SOLiD sequencing data (supplementary results; supplementary table 6
). 79 novel coding substitutions and 354 randomly chosen genome-wide variants called by the algorithm were also tested. 77 (97%) of the coding substitutions and 333 (94%) of the random variants were confirmed as genuine somatic mutations (supplementary table 7
). Neither of two known indels in coding sequence was identified. Of putative somatic indels that were called, the true-positive rate was 25% by capillary sequencing (supplementary results; supplementary table 8
). Therefore, only somatic indels which were confirmed by capillary sequencing are reported here. All somatic genomic rearrangements called by anomalous read-pairs were validated by PCR and capillary sequencing across the breakpoint, as previously described14
Repertoire of somatic mutation
Overall, 22,910 somatically acquired substitutions were identified across the NCI-H209 genome, and a further 65 indels, 334 copy number segments and 58 structural variants were confirmed (, , supplementary tables 1-5
Somatically acquired genomic variants of all classes in a SCLC genome.
For point mutations in coding regions, we found the previously described RB1
C706F mutation, known to abrogate protein function19
, and the mutation that disrupts a splice site in TP53
. Combined loss of RB1
is a characteristic feature of SCLC, confirming that NCI-H209 is genetically typical of this disease. One G>T transversion generated a premature stop codon in MLL2
. We have observed clustering of truncating mutations in this gene, a histone methyltransferase, in renal cancer (manuscript submitted). Of coding variants, 92 are predicted to change amino acids, and 36 are synonymous. Since cancer is a clonal disease in which the phenotypic consequences of mutation are subject to Darwinian natural selection, accumulation of mutations conferring selective advantage on cancer subclones will manifest as an excess of non-synonymous mutations. However, the observed non-synonymous:synonymous ratio of 2.56:1 is not significantly different from that expected by chance (p=0.3), suggesting that the majority of coding variants do not confer a selective advantage to the cancer.
Due to the limited throughput of capillary sequencing, there has previously been little attempt to explore regulatory regions of the genome for potential oncogenic mutations. To address this, we extracted somatic substitutions occurring within 2kb either side of known transcription start sites, which would generally include gene promoters. Mutations were evenly distributed across the 4kb regions (supplementary figure 2A
). We applied hidden Markov models to predict which substitutions might affect transcription factor binding sites. The distribution observed was no different to that seen in random, simulated sets of ‘mutations’ (supplementary figure 2B
), suggesting that, analogous to substitutions in coding sequence, most of those found in regulatory regions are selectively neutral to the cancer. Nonetheless, as with coding mutations, there may be a small number which alter transcription factor binding and affect gene regulation, thus providing phenotypic variation for selection to act upon. For example, a T>G mutation 49bp upstream of the transcription start site of a gene in the RAS oncogene family, RAB42
, is predicted to have significant disruptive effects on a potential binding motif for the RAS-responsive RREB1 transcription factor (p=3×10−98
; supplementary figure 2C
Taken together, these data suggest that the majority of mutations in coding and promoter regions of the NCI-H209 genome are passenger events, conferring no selective advantage to the cells. Ranking algorithms can be useful to prioritise variants for further study, but the key evidence for identifying driver mutations is recurrence in independent tumour samples, supplemented by functional studies.
Multiple mutation signatures in NCI-H209
Tobacco smoke contains more than 60 carcinogens which bind and chemically modify DNA, characteristically forming bulky adducts at purine bases (guanine and adenine)3
. Adducts distort the DNA helix and, if not corrected by nucleotide excision repair (NER) or other pathways, allow non-Watson-Crick pairing during DNA replication. The physicochemical properties of the mutagen determine which adduct is formed, what repair mechanism is induced and which mis-pairing is permissible3
. The substantial mutational load carried in the NCI-H209 genome allows us to discern with great statistical power several distinct mutation signatures, genomic records of the medley of mutagens deposited in the airways and lungs by tobacco smoking.
G>T/C>A transversions were the commonest change observed (34%), followed by G>A/C>T (21%) and A>G/T>C (19%) transitions (). This distribution is remarkably similar to the pattern of substitutions observed in TP53
in SCLC cases curated from the published literature (supplementary figure 3
). This implies firstly that the NCI-H209 genome is typical of SCLC, and therefore of tobacco-associated mutational profiles, and secondly that the majority of mutations were acquired in vivo
, not during cell culture. G>T transversions caused by polycyclic aromatic hydrocarbons occur more frequently at methylated CpG dinucleotides in vitro
and in TP5320,21
. To explore this genome-wide, we compared the base preceding G>T mutations with the base before wild-type guanines in NCI-H209 (). CpG dinucleotides were significantly enriched amongst the G>T mutation set compared to controls (odds ratio (OR), 1.5; 95% CI, 1.3-1.6; p<0.0001). We can use the fact that only 10-20% of CpG dinucleotides in CpG islands are constitutively methylated compared to 60-70% outside of CpG islands22
to assess how cytosine-methylation affects mutations at the neighbouring guanine (). G>T mutations at CpG dinucleotides were significantly more likely to be found outside CpG islands than expected by chance (OR, 1.8; 95% CI, 1.1-2.8; p=0.02), suggesting that these transversions do indeed preferentially occur at methylated CpGs.
Figure 2 The mutation profile of NCI-H209. (A) Numbers of mutations in each of the 6 possible mutation classes. (B) Fraction of the three classes of guanine mutations occurring at CpG dinucleotides in NCI-H209, with p values reflecting the comparison with the (more ...)
We next assessed the base preceding the guanine for G>A and G>C mutations (). For G>A transitions, striking enrichment of CpG dinucleotides was observed in the mutation set compared to wild-type guanines in the genome (OR, 4.0; 95% CI, 3.7-4.3; p<0.0001), and these showed a strong propensity to occur outside CpG islands (OR, 2.6; 95% CI, 1.6-4.1; p<0.0001). This is consistent with the well-described phenomenon of spontaneous deamination of methylated cytosine to uracil, read as thymine. Although G>C transversions showed a similar enrichment for CpG context (OR, 2.2; 95% CI, 1.9-2.5; p<0.0001), these were significantly more likely to occur within CpG islands (OR, 0.6; 95% CI, 0.4-1.0; p=0.05), suggesting that the carcinogen responsible targets unmethylated
CpG dinucleotides. In keeping with previous reports23,24
, we found that the guanine base in G>C transversions was more frequently followed by an adenine than expected by chance (OR, 1.4; 95% CI, 1.3-1.5; p<0.0001).
For mutations involving adenines, fewer substitutions of all classes were seen at GpA dinucleotides than expected by chance (p<0.0001; ), and A>T and A>G occurred significantly more frequently at TpA than expected (p<0.0001). Among somatically acquired indels, single base-pair insertions were more likely to be gains of A or T nucleotides than C or G (8:1). Curiously, single base deletions favoured loss of C/G nucleotides, rather than A/T (26:12), and there was a propensity for the C/G deletions to occur at CC or GG dimers or longer (18/26). In contrast to the frequency of indels at runs of A or T nucleotides, deletions at C or G tracts are not well described, and our findings may reflect a distinct mutation signature.
Thus, the sequence context of the ~23,000 mutations in the NCI-H209 genome provides tremendous power to identify multiple distinctive mutation signatures, not evident from targeted resequencing studies of limited genomic regions.
Imprint of two DNA repair pathways
Several pathways can repair DNA lesions caused by exogenous carcinogens. Bulky adducts on purines are the predominant form of DNA damage induced by tobacco carcinogens, and can be sufficiently disruptive to impede RNA polymerase when they occur on the transcribed strand of genes. Stalled RNA polymerases can recruit the nucleotide excision repair machinery, leading to excision of the altered nucleotide, preventing mutation. In studies of TP53
mutations in lung cancer, G>T transversions occur more frequently on the non-transcribed strand2,5
, suggesting that many of the same lesions occurring on the transcribed strand are correctly identified and removed by the cell. We found that guanine and adenine substitutions are generally less frequent on the transcribed than the non-transcribed strand (supplementary figure 4
), confirming that purines appear to be the major target of carcinogens in tobacco smoke.
We next correlated mutation prevalence to gene expression (). For a given level of gene expression, the effects of transcription-coupled repair are revealed by the significant separation of curves for mutations on the transcribed and non-transcribed strands. We found evidence for significant transcription-coupled repair for G>T transversions (p<0.0001), as well as A>G (p=0.003) and A>T (p=0.03), possibly G>C (p=0.08), but not G>A (p=0.3) or A>C (p=0.8) mutations. Thus, the extent of transcription-coupled repair differs for the various classes of mutation, presumably reflecting differences in the ability of the transcription-coupled repair machinery to recognise and/or repair different adduct lesions.
For most mutations, there appears to be another novel expression-linked repair pathway that operates on both strands and is at least as numerically important as transcription-coupled repair. Thus, significantly lower mutation prevalence, on both transcribed and non-transcribed strands, was observed in more highly expressed genes for G>T (p<0.0001), G>A (p<0.0001), G>C (p<0.0001) and A>T (p<0.0001). Again, there are some interesting differences across mutation classes in the relative contributions of the two repair pathways. For A>G mutations, only transcribed strand mutations decreased with higher gene expression, suggesting that transcription-coupled repair is the more important pathway for preventing such events. In contrast, G>A mutations occurred equally on transcribed and non-transcribed strands, but mutations on both strands were significantly reduced in more highly expressed genes, suggesting that the novel expression-linked repair pathway is more important than transcription-coupled repair here.
Taken together, these data imply that at least two separate DNA repair pathways have been enlisted for protection of the NCI-H209 genome, notwithstanding the difficulties in extrapolating cell line expression levels to in vivo expression during cancer progression. The fact that the two pathways have operated with differing efficacy across the six classes of mutation implies that the lesions have distinct physicochemical effects on DNA structure, with variable recognition and excision by the genome surveillance machinery.
Genomic rearrangements and copy number
We identified 58 somatically acquired genomic rearrangements in the NCI-H209 genome. These include 18 (31%) deletions and 9 (16%) tandem duplications. The majority of rearrangements, however, cannot be ascribed to classical structural variant patterns, due to the considerably greater complexity of somatically acquired rearrangements compared to germline events. This is exemplified by a set of rearrangements incorporating regions from chromosomes 1p32-36 and 4q25-28 (). Here, most of the intrachromosomal rearrangements are in inverted orientation, but cannot be classical inversions since they demarcate copy number changes and do not have reciprocal breakpoints. By similar reasoning, most interchromosomal rearrangements also appear to be unbalanced. Other clusters of unbalanced rearrangements were found in NCI-H209, including chromosomes 3q and 5q, and we have seen this phenomenon in many other solid tumour genomes.
Figure 3 Localised complexes of somatically acquired genomic rearrangements in NCI-H209. Copy number plots across regions on chromosomes 1 and 4 are shown. Inverted intrachromosmal rearrangements (blue), non-inverted intrachromosomal rearrangements (brown) and (more ...)
Chromosomal rearrangements can juxtapose two genes: if they are in the same orientation with an intact open reading frame, an oncogenic fusion gene may result. In NCI-H209, a predicted in-frame fusion gene was created by a 240kb deletion on chr16, adjoining the first two exons of CREBBP
with the 3′ portion of BTBD12
, a gene involved in repair of dsDNA breaks25,26
. Interestingly, in acute myeloid leukaemia, CREBBP
is recurrently fused with MYST327
. RT-PCR showed that the predicted CREBBP-BTBD12
fusion transcript is expressed in NCI-H209, but not in 55 other SCLC cell lines. The significance of the predicted fusion gene with respect to cancer development is therefore unclear.
CHD7 rearrangements in SCLC cell lines
Intrachromosomal rearrangements can also result in internal rearrangements of genes, through loss or duplication of exons. A 39kb tandem duplication was found in CHD7
, predicted to lead to in-frame duplication of exons 3-8 (). We previously identified a massively amplified and highly expressed fusion gene comprising exons 1-3 of PVT1
, a non-coding RNA gene immediately downstream of MYC
, and exons 4-38 of CHD7
in another SCLC cell line, NCI-H217114
. This raises the possibility that CHD7
rearrangements may be recurrent in SCLC. Using multiplex ligation-dependent probe amplification, we identified a further SCLC cell line (LU-135) with internal exon copy number alterations, among 63 lines screened (supplementary figure 5
). LU-135 was therefore studied by mate-pair sequencing (). This demonstrated that, as for NCI-H2171, the CHD7
amplicon was linked to MYC
amplification. One breakpoint predicted the existence of a fusion gene between exon 1 of PVT1
and exons 14-38 of CHD7
(), and by RT-PCR across the breakpoint, this transcript is expressed. In keeping with genomic amplification and active expression of the PVT1
locus, NCI-H2171 and LU-135 show particularly elevated levels of CHD7
transcripts (). SCLC cell lines on average show a log2
greater expression of CHD7
than both non-small cell lung cancer lines and other tumour types (p<0.0001).
Figure 4 CHD7 rearrangements in SCLC cell lines. (A) A somatically acquired 39.5kb tandem duplication is found in NCI-H209. (B) The LU-135 cell line shows co-amplification of the 3′ portion of CHD7 together with MYC and the 5′ portion of PVT1. (more ...)
is rearranged in three SCLC cell lines. Two carry a PVT1-CHD7
fusion gene in the setting of MYC
is a non-coding gene immediately downstream of MYC
, and may itself be a transcriptional target of the MYC protein28
. Insertion of CHD7
into this locus with subsequent amplification gives the double hit of increased gene copy number and regulatory elements for a co-amplified transcription factor, explaining the massive over-expression seen in these cell lines. PVT1
is recurrently rearranged in variant Burkitt lymphoma translocations29
, and may be oncogenic30
. The NCI-H209 rearrangement is predicted to duplicate one of the two chromodomains. CHD7 is a chromatin remodeller, promoting enhancer-mediated transcription through association with histone H3K4-methylation31
. Histone modifiers have been implicated as cancer genes32
, and a family member, CHD5
, may function as a tumour suppressor gene33
. Recurrent rearrangements of CHD7
in SCLC would be an interesting extension of this theme if functional studies and genomic analyses of primary samples confirm our data.