|Home | About | Journals | Submit | Contact Us | Français|
The use of next-generation sequencing technologies to produce genomic copy number data has recently been described. Most approaches, however, reply on optimal starting DNA, and are therefore unsuitable for the analysis of formalin-fixed paraffin-embedded (FFPE) samples, which largely precludes the analysis of many tumour series. We have sought to challenge the limits of this technique with regards to quality and quantity of starting material and the depth of sequencing required. We confirm that the technique can be used to interrogate DNA from cell lines, fresh frozen material and FFPE samples to assess copy number variation. We show that as little as 5 ng of DNA is needed to generate a copy number karyogram, and follow this up with data from a series of FFPE biopsies and surgical samples. We have used various levels of sample multiplexing to demonstrate the adjustable resolution of the methodology, depending on the number of samples and available resources. We also demonstrate reproducibility by use of replicate samples and comparison with microarray-based comparative genomic hybridization (aCGH) and digital PCR. This technique can be valuable in both the analysis of routine diagnostic samples and in examining large repositories of fixed archival material.
Systems medicine is expected to link complex molecular data underlying disease phenotypes to patient outcomes (1). As technology for data generation, such as next-generation (NG) sequencing, and methodology for computational analysis develop rapidly, the accessibility of clinical material of sufficient quality and quantity is likely to be rate limiting in the discovery process. For cancer patients, diagnosis is virtually always based on the evaluation of a tumour sample, ideally a biopsy or surgical specimen. Left-over tissue, stored in pathology archives and no longer required following diagnosis or clinical management, proves an immense resource for molecular analysis, as each cancer patient is a potential donor. The fixation procedures, necessary for tissue preservation in the diagnostic setting, may however compromise the quality of the material with regard to some applications. One of the applications described for NG sequencing is the production of copy number variation (CNV) data by analysing the distribution of aligned reads to a reference genome (2–5). These initial studies have used a depth of sequencing that would preclude the routine analysis of large numbers of samples. They have also used optimal starting conditions in terms of DNA quality and quantity, which are not likely with clinical samples. However, Schweiger et al. (6) recently showed that DNA isolated from formalin-fixed paraffin-embedded (FFPE) tumours could be used for sequencing-based CNV analysis. Here, we report how we have modified the current methods for sequencing-based CNV analysis and overcome their limitations, to allow the high-throughput copy number analysis of DNA from suboptimal quality material, such as FFPE tissue. Specifically, we have investigated the feasibility of undertaking sequence analysis of samples in pools using unique oligonucleotide tags to distinguish individual patients. We demonstrate that up to 10 patient samples can be pooled in one sequencing lane of an Illumina Genome Analyzer II (GAII; Illumina Inc, San Diego, CA, USA). We also show that as little as 5 ng (compared to Illumina’s recommended 1 µg) of template DNA from FFPE specimens is sufficient to generate a library of fragments for sequence analysis, resulting in a copy number karyogram that is indistinguishable from a karyogram generated from 1 µg of template DNA or from DNA isolated from the same fresh-frozen tumours without prior fixation. These modifications will enable exploitation of the vast archives of FFPE material to supply the systems medicine research pipeline, and will bring this methodology within the reach of routine clinical analysis, while preserving the majority of the tissue block for other investigations.
To illustrate the value of this approach, we have obtained FFPE samples of repeat biopsies and resection specimens from the same patient whose oral cancer recurred over a 2-year period. We show that all samples share certain copy number aberrations indicating a likely common progenitor, but that the most recent samples display a new region of chromosomal amplification indicating that an initial clonal population of tumour cells has continued to mutate.
LUDLU-1 cell line was established from a lung squamous cell carcinoma (7). AGLCL cell line was established following EBV infection of normal B cells from the same patient. Both cell lines were cultured in RPMI 1640 medium supplemented with 2 mM glutamine, 50 U/ml penicillin, 50 µg/ml streptomycin and 10% foetal bovine serum using standard cell culture techniques. The HONE-1 cell line is an epithelial cell line derived from a poorly differentiated nasopharyngeal squamous cell carcinoma (8). It was cultured in Iscove’s modified Dulbecco’s medium, supplemented with 2 mM glutamine, 50 U/ml penicillin, 50 µg/ml streptomycin and 10% foetal bovine serum using standard cell culture techniques.
Surgical resection specimens of lung tumours and corresponding normal lung tissue were prospectively collected at the local Department of Thoracic Surgery, snap frozen and stored in aliquots at −80°C. FFPE blocks of lung tumours, oral tumours and dysplastic lesions were retrieved from the local pathology archive. Approval was obtained from the local ethics committee and written informed consent for the use of their tissue for research was available for all patients.
Genomic DNA from clinical samples and cell lines was prepared by sequential phenol/chloroform extraction followed by ethanol precipitation, adapted from a previously described method (9). Briefly, frozen tissue was ground using pestle and mortar and the resulting powder suspended in 0.4% LiDS lysis buffer; cell lines were harvested from the culture flask, pelletted by centrifugation and resuspended in 0.4% LiDS lysis buffer. Proteinase K was then added to the cell lysate to a final concentration of 100 μg/ml and incubated overnight at 55°C. Following three phenol extractions and one with chloroform, the final aqueous solution was added with 0.3 M sodium acetate pH 5.2 (1:10 vol:vol) and DNA precipitated by the addition of two volumes of 95% ethanol. The DNA precipitate was then collected by centrifugation, washed with 70% ethanol, air dried and finally dissolved in sterile water.
Areas of dysplasia or tumour to be dissected were identified and marked by a pathologist on an haematoxylin and eosin (H+E) stained slide cut from each FFPE block to be sampled. Ten 7 μm sections were then cut and the dysplastic or tumour tissue was macro-dissected with a scalpel blade using the marked H+E slide as a guide. A further section was cut and then H+E stained to confirm persistence of histology throughout the areas sampled. DNA extraction was performed using the QIAamp DNA micro kit (Qiagen, Sussex, UK) according to the manufacturer’s instructions and DNA was eluted in 25 μl of AE Buffer.
DNA concentration and purity was determined using the Nanodrop-8000 (Fisher Scientific UK Ltd, Leicestershire, UK) and the Quant-iT PicoGreen dsDNA BR assay (Invitrogen, Paisley, UK).
DNA from the LUDLU-1 and AGLCL cell lines (450 ng each) was labelled in the presence of Cy3-or Cy5-labelled nucleotides as previously described (10) but purified using a PureLink PCR Purification Kit (Invitrogen) according to the manufacturer’s instructions. Labelled material was combined, hybridized to Agilent Human Genome 244K CGH microarrays (Agilent Technologies, Santa Clara, CA, USA), washed and scanned according to Agilent’s Oligonucleotide Array-based CGH for Genomic DNA Analysis protocol. Data were extracted from the scanned images using Feature Extraction software (v7.1; Agilent Technologies) and analysed using Agilent’s DNA Analytics package.
Between 5 ng and 1 μg genomic DNA were used to prepare the DNA libraries for sequencing, following standard Illumina protocols. DNA was sheared on a Covaris S2 Sample Preparation System (Covaris Inc., Woburn, MA, USA) and checked for appropriate size distribution on an Agilent Bioanalyser DNA 1000 LabChip. End repair was performed by using the End-It DNA End Repair Kit (Epicentre Biotechnologies, Madison, WI, USA) and Klenow DNA polymerase, followed by ligation of 6 bp unique tag adapter oligonucleotides, using previously established methods (11). Tags were chosen so as to avoid over-representation of any one base at each position, which can interfere with cluster recognition. Fragments were size selected to 200 bp using a gel cut step. Samples were enriched using a 12-cycle enrichment PCR. For low concentration DNA samples, an 18-cycle enrichment PCR was performed before the gel cut stage rather than afterwards. Libraries were then examined using an Agilent Bioanalyser DNA 1000 LabChip and Invitrogen’s Quant-iT Picogreen dsDNA BR assay kit to assess for DNA quality and concentration, respectively. This information was used to pool equal amounts of each sample library for cluster amplification and either 51 or 76 cycles of Illumina sequencing by synthesis, resulting in 45/70 bp of genomic DNA sequence and 6 bp of tagged adapter. Sequencing was initially done with 51-bp reads but the move was made to 76-bp reads as machine and analysis package upgrades resulted in better base calling for longer reads.
Image analysis and base calling were performed using Illumina’s CASAVA pipeline. Reads were trimmed of their 6 bp tags with the USE_BASES option and uniquely aligned to the human genome (UCSC version hg19) using the alignment algorithm Eland (12). Python scripts were used to first split the reads into files according to tag, and then to make pairwise comparisons of each tumour and normal sample by splitting the genome into non-overlapping windows of equal numbers of normal reads, typically 400, and counting the number of tumour reads which fell into each window.
Copy number analysis was done in R, first by normalizing the number of tumour and normal counts across the genome and calculating the log2 ratio of normalized tumour:normal read counts for each window. Second, segments of equal copy number were called using the Bioconductor DNAcopy package (13). These segments were converted into bedgraphs suitable for uploading onto the UCSC genome browser (http://genome.ucsc.edu/) (14) as well as being plotted over graphs of tumour read counts. A summary of the number of aligned reads and subsequently detected CNVs for each sample is listed in Table 1.
A Pearson correlation was calculated between the array-CGH (aCGH) and sequencing data by calculating an average aCGH generated copy number for every genomic window of sequencing reads. This made it possible to perform a pairwise comparison of two files of equal length.
A similar method was used to calculate the correlation between molecular copy number counting (MCC) and sequencing data, sampling the sequence based copy number for every MCC data point.
We began by analysing the copy number profile of genomic DNA from the lung squamous cell carcinoma cell line LUDLU-1 and the DNA from the paired normal B cell line AGLCL. We confirmed that the tumour karyogram demonstrated copy number gain and loss features predicted for a squamous cell lung carcinoma such as 3p loss and distal 3q and 8q amplification (15–17) (Supplementary Figure S1). We confirmed by comparing the normal DNA from five individuals that comparison of normal DNA against normal DNA resulted in a featureless karyogram with no detectable gain or loss (result not shown).
Further validation was achieved by comparing the copy number profile generated by NG sequencing with that obtained by aCGH. High-resolution aCGH was performed using an Agilent 244K array using DNA from LUDLU-1 and AGLCL cells. The aCGH-generated copy number profiles appeared almost identical to those obtained from sequence analysis. Every chromosome showed the same pattern of gain and loss. Even smaller features such as small spikes of gain or loss were replicated. A Pearson correlation of 0.9362277 was calculated between the two data sets. Examples from the aCGH profile compared to a copy number karyogram generated from NG sequencing data is shown in Figure 1.
To determine the validity of copy number annotation by sequencing even at high resolution, we compared its performance to another method that also generates copy number data in a digital format, MCC (18,19). We compared the copy number profile generated by NG sequencing with the MCC data for the 17 Mb amplicon in the distal part of chromosome 3q in the HONE-1 cell line and found the profiles to be very similar. Gains and losses were seen in the same places, although the MCC data were noisier and suggested a slightly higher copy number in places. A Pearson correlation of 0.8126372 was calculated between the two data sets (Figure 2).
The reproducibility of copy number data produced from NG sequencing was confirmed by demonstrating virtually identical copy number karyograms on analysis of duplicates of four sets of DNA from tumour:normal pairs of fresh frozen lung squamous cell carcinomas (Supplementary Figure S2).
Finally, we compared the copy number karyograms for DNA extracted from snap frozen versus FFPE material from the same lung squamous cell carcinomas (Figure 3). As previously shown by Schweiger et al. the matching fresh and fixed copy number karyograms for some of the samples were almost identical even when examined at high resolution. For other samples, the positions of regions of CNV in the fixed samples were identical to the frozen samples, but the magnitude of variation was greater. For example, in the fresh-frozen sample, LS043, shown in Figure 3, the distal 9 Mb of 9p has a tumour:normal ratio of 1.1:1, while the rest of the 9p arm has a ratio of 0.82:1. In the FFPE sample the two regions are in the same position, but have ratios of 1.51:1 and 0.75:1, respectively. This difference is probably due to the removal of non-cancerous cells (i.e. stromal cells, inflammatory cells and endothelial cells) by macrodissection in the fixed samples.
Due to the difficulty in obtaining large amounts of good quality DNA from many clinical samples, especially fixed archival surgical and biopsy specimens, we undertook to investigate the minimum amount of DNA that is needed to produce acceptable sequence data for copy number analysis. We performed a series of dilutions of DNA from one of the frozen tumour samples and one of the FFPE samples, so that sequencing libraries were generated from 400 ng, 200 ng, 100 ng, 50 ng, 10 ng and 5 ng of starting DNA, compared to Illumina’s recommended 1 µg. As all the libraries appeared to be within normal parameters as judged by Agilent readings, only those prepared from 50 ng, 10 ng and 5 ng were sequenced. While it was difficult to accurately titrate the lower concentration libraries to give a consistent number of sequencing reads, and some samples gave a low percentage of alignable reads, the karyograms produced were almost identical to those made under Illumina’s recommended conditions suggesting that copy number data could be obtained from nanogram quantities of DNA isolated from tissue sections of a FFPE block (Figure 4). To confirm this in a real sample, we obtained FFPE blocks of sequential biopsies and surgical specimens from a patient with multifocal oral cancer who has been under the care of the local maxillo-facial unit for several years. Specifically, we obtained blocks for (i) a biopsy of a tumour in the left tongue in May 2007, (ii) a biopsy of dysplasia in the right floor of the mouth in June 2008, (iii) a wide excision specimen of the same dysplasia in July 2008 and (iv) four distinct specimens of tumour-associated dysplasia obtained from a further anterior floor of mouth surgical resection in August 2009. DNA was isolated from macrodissected dysplastic and tumour tissue and libraries were prepared from template DNA ranging from 55 to 270 ng. The karyograms were mostly normal (results not shown): the most notable feature was gain of 10p, which was common to all lesions. The synchronous biopsies obtained in the surgical field in August 2009 carried an 8Mb region of amplification of distal 9p in addition to gain of 10p (Figure 5).
We investigated sequencing various numbers of pooled libraries together in order to gain an understanding of the quality of data obtainable from different levels of reagent and machine capacity investment. The initial experiment using the LUDLU-1:AGLCL cell line pair was performed using one eighth of an Illumina GAII run and produced 7 897 570 uniquely aligning reads. Of these, tagging indicated that 2 547 384 were from LUDLU1. A total of 2 500 000 reads of 45 bp (51 bp – 6 bp tag) represents 113 Mb, or ~0.04× coverage. The 206 CNV regions were detected with mean size of 15 Mb and a smallest detectable size of 15 Kb.
For the subsequent experiment using DNA from snap-frozen tumour:normal pairs, the level of multiplexing was increased so that 10 samples (5 pairs) were pooled and analysed together on one eighth of an Illumina GAII run. Between 847 576 and 1 321 486 reads per sample were obtained. Between 57 and 115 copy number variation regions were detected, averaging between 26 and 58 Mb in size, with a smallest detectable region between 0.9 and 1.5 Mb. This experiment was duplicated and the reads from each sequencing run combined to give between 1 873 100 and 2 328 268 reads per sample (0.028–0.035× coverage). The overall CNV pattern of each sample remained unchanged (Figure 6a and b), but the DNAcopy algorithm was able to detect smaller regions of copy number variation, between 89 and 200 Kb. As a result, the number of observed regions increased to between 81 and 187.
To determine how little data would give reproducible results, 90% of the reads from one of the frozen samples (LS010) were randomly removed in silico, leaving 136 405 tumour reads or 0.002× coverage. This is an approximate simulation of running 80 tagged samples on one lane, or 0.15% of the sequencer’s total capacity. Currently, this is not technically feasible, but may be achievable in the near future as methods such as DNA Sudoku mature (20). Unsurprisingly, all fine scale data were lost, the smallest region of variation being detected being 5 Mb. It was still perfectly possible to see large-scale aberrations such as the gain or loss of whole chromosomal arms (Figure 6c), suggesting that this methodology could still provide useful data even when highly multiplexed.
To further explore the theoretical limits of multiplexing, a series of simulations were performed using the LUDLU-1:AGLCL cell line pair. An additional 7.4 million AGLCL reads were sequenced, giving a total of 12 218 030 reads. Reads were randomly stripped away in silico from both samples, resulting in five files of AGLCL reads ranging between 2 441 867 and 12 218 030 and 20 files of LUDLU-1 reads ranging between 127 421 and 2 551 569. 127 421 reads is <1% of the standard output from one lane of an Illumina GAII during our experiments. Every combination of these two samples was then analysed, using LUDLU-1 as the test and AGLCL as the reference sample, and keeping the window sizes equivalent to 200 LUDLU-1 reads. The results are shown in Figure 7. The number of reference reads appeared to have almost no effect on either the number of CNVs, mean size or smallest detectable CNV. Not surprisingly, the number of LUDLU-1 reads, and hence the size of the windows used had a much greater effect. Resolution gradually decreased alongside read number, but with a sudden decrease once read numbers dropped below 500 000. It may be that this is the point at which a window size of 200 reads is bigger than many of the actual CNVs from this sample.
We have demonstrated that NG sequencing platforms can be used in a high-throughput, cost-effective manner to elucidate copy number information from a variety of DNA sources, including cell lines, frozen tumour samples and FFPE material. We have shown that good quality data can be obtained when multiplexing up to 10 samples on one lane of an Illumina GAII. We have shown that the method is reproducible, and that the resolution is highly flexible and adjustable. The resolution has been shown to be comparable with aCGH when performed at low levels of multiplexing and has a high degree of correlation. NG sequencing also produces much more data than a comparable investment in a PCR-based method such as loss of heterozygosity (LOH) analysis or MCC when performed at higher levels of multiplexing. Sequence data and MCC does show a strong correlation, but not as strong as that between sequencing and aCGH. We have concentrated on tumour samples, but the method is equally applicable to studies of CNV in constitutional DNA. In fact, the data are much easier to interpret when all the cells in a sample have the same genotype.
Currently, the leading technology for investigating CNV is aCGH. However, limitations remain for certain applications. aCGH has proved difficult to use with DNA from FFPE samples. Typically, researchers have had to devise ingenious upstream methods in order to study archival material (21). Also, aCGH typically requires microgram quantities of DNA. When smaller samples are studied, a whole-genome amplification step is generally incorporated (22). Our method requires almost no extra fine tuning as we moved from cell line DNA to nanogram amounts of DNA from archival FFPE material, showing that neither a large amount of template or a whole genome amplification step are required. In addition, although we have only demonstrated the use of these data for copy number analysis, each lane of sequencing generates in the region of 700 Mb of sequence data, which can be analysed for other purposes such as searching for genetic variants or viral infection.
Previous studies have used read depth of sequence to examine copy number but with ideal reaction conditions and greater depth of coverage (2–5). Other studies have explored the limits of this technology further, by use of multiplexing (23) or by using DNA from FFPE samples (6). We have sought to combine both approaches and to further push the limits as to the minimum amounts of starting DNA required. The multiplexing aspect is important because it allows researchers to tailor their study design according to the number of samples, desired resolution and available resources. The same sequencing library can be used for a 1 Kb or a 10 Mb resolution experiment. The only difference is the amount of sequencing required. Libraries can be aliquoted for use a number of times so that samples can be initially screened cheaply at low resolution and then further examined at high resolution at a later date and with no additional preparation. Data from two duplicate experiments on the same sample can be merged to double the coverage.
We have demonstrated that sequencing libraries can be constructed starting with nanogram quantities of DNA and, by combining individually tagged libraries, copy number karoygrams can be generated from 10 samples in a single Illumina sequencing run, with the theoretical possibility of extending this number up to at least 80. Besides demonstrating this high-throughput potential, we have also confirmed the observations of Schweiger et al. (6) that DNA isolated from FFPE material can serve as a template to create sequencing libraries for NG sequencing. We have substantially extended the range of samples that can be studied by reducing the amount of template required for library construction from 1.5 µg to less than 100 ng. This is an important practical consideration because it allows minute tissue samples such as biopsies to be analysed. Not only are biopsies small but also the way they are obtained from patients often incorporates underlying stroma, requiring micro-dissection to increase the proportion of abnormal epithelial cells, decreasing the amount of DNA template still further. However these difficulties must be overcome, as analysis of biopsies is essential for both basic and translational cancer research.
In attempting to identify the genomic drivers of malignant tumour initiation and progression of upper aerodigestive cancer, we have used a number of molecular genetic techniques to compare candidate regions, genes and chromosomes of pre-invasive lesions, obtained as biopsies, and the subsequent tumours that develop at the site of the earlier pre-invasive lesions (24–26). In this study, we have used NG sequencing to generate whole genome copy number karyotypes of biopsies and tumours obtained in a chronologic series from a patient with oral cancer. The patient’s cancer has had a relatively indolent disease course reflected by several invasive tumours and dysplastic lesions but no metastatic disease. The karyotypes of the tumour and dysplastic lesions obtained carry relatively few gains and losses compared other the other carcinoma karyograms we examined. This may be a reflection of the tumour’s histology, a verrucous squamous cell carcinoma, whose benign growth pattern may in turn be a consequence of a relatively normal genotype. However, all specimens examined carry gain of the short arm of chromosome 10 indicating a clonal origin. Those obtained from the most recent surgery demonstrated genomic progression, having acquired amplification of distal 9p, an 8 Mb region encompassing 63 predicted genes (27). 9p loss has been associated with tumour progression (28,29), but gain in this region so far has not.
Besides the analysis of biopsies for laboratory studies of clonality and the genomics of tumour progression, the genomic assessment of tumours via the preliminary diagnostic biopsy may a be a useful supplement to routine histopathology, as knowledge of a tumour’s genotype can provide information for prognosis and treatment response (30). The international initiatives to document cancer genomes (http://www.icgc.org/, http://cancergenome.nih.gov/about/index.asp) will substantially augment our current knowledge. We have used fixed archival specimens in this study, but the current samples of each new patient are also fixed as part of routine clinical practice, and the ability to generate whole-genome information from the fixed samples obtained as demonstrated here, indicates how cataloguing genome architecture could be become part of the repertoire of diagnostic tests for cancer.
Supplementary Data are available at NAR Online.
Yorkshire Cancer Research (L341PG); Marie Curie Intra-European Fellowship within the 7th European Community Framework Programme (219540 to O.B.); Cancer Research UK and the Wellcome Trust (to D.J.A.). Funding for open access charge: Yorkshire Cancer Research.
Conflict of interest statement. None declared.
The authors thank Dr Maria Kost-Alimova, Karolinska Institute, Stockholm, Sweden, for kindly providing the HONE-1 cell line used in this study.