Sample collection and preparation
Plasma samples were collected from cancer patients CRC11 to CRC17, BR1 to BR3 (4 to 17 ml each), and unrelated normal controls N1 to N10 (5 to 18 ml each). Matching tumor DNA samples were obtained from patients’ formalin-fixed paraffin-embedded (FFPE) surgically resected tumor. Normal DNA samples were obtained from either matched lymphocytes or matched normal FFPE tissue obtained at the time of surgery. Whole-genome sequence data from normal lymphocytes of 28 individuals with neuroblastoma or pancreatic cancer were obtained from previous studies (33
). Genotype and focal amplification data from 36 colorectal and 45 breast cancer samples, representing early-passage cell lines or xenografts established from patients with late-stage disease, were obtained from previous studies (25
). All samples were obtained in accordance with the Health Insurance Portability and Accountability Act.
Preparation of Illumina fragment sequencing libraries
Plasma DNA libraries were prepared following Illumina’s suggested protocol with the following modifications: (i) circulating DNA isolated from plasma was mixed with 10 μl of End Repair Reaction Buffer and 5 μl of End Repair Enzyme Mix (catalog no. E6050, New England Biolabs) in a final volume of 100 μl. The end-repair mixture was incubated at 20°C for 30 min, purified by a PCR purification kit (catalog no. 28104, Qiagen), and eluted with 45 μl of elution buffer (EB) prewarmed to 70°C. If starting DNA volumes were >85 μl, multiple 100 μl end-repair reactions were used per sample and purified over the same Qiagen column as indicated above. (ii) To A-tail, all 42 μl of end-repaired DNA was mixed with 5 μl of 10× dA Tailing Reaction Buffer and 3 μl of Klenow (exo-) (catalog no. E6053, New England Biolabs). The 50-μl mixture was incubated at 37°C for 30 min before DNA was purified with a MinElute PCR purification kit (catalog no. 28004, Qiagen). Purified DNA was eluted with 27 μl of EB. (iii) For adaptor ligation, 25 μl of A-tailed DNA was mixed with 10 μl of PE-adaptor (Illumina), 10 μl of 5× Ligation buffer, and 5 μl of Quick T4 DNA ligase (catalog no. E6056, New England Biolabs). The ligation mixture was incubated at 20°C for 15 min. (iv) To purify adaptor-ligated DNA, 50 μl of ligation mixture from step (iv) was mixed with 200 μl of NT buffer and cleaned up by NucleoSpin column (catalog no. 636972, Clontech). DNA was eluted in 50 μl of EB. (v) To obtain an amplified library, 10 PCRs of 25 μl each were set up, each including 12 μl of H2O, 5 μl of 5× Phusion HF buffer, 0.5 μl of a deoxynucleotide triphosphate (dNTP) mix containing 10 mM of each dNTP, 1.25 μl of dimethyl sulfoxide (DMSO), 0.5 μl of Illumina PE primer 1, 0.5 μl of Illumina PE primer 2, 0.25 μl of Hot Start Phusion polymerase (M0530L, New England Biolabs), and 5 μl of the DNA from step (v). The following PCR program was used: 98°C for 1 min; 10 cycles of 98°C for 20 s, 65°C for 30 s, 72°C for 30 s; and 72°C for 5 min. The pooled 250-μl PCR was purified by two sequential AMPure XP bead (catalog no. A63881, Beckman Genomics) purifications with a 1:1 PCR product/AMPure bead mix. Library DNA was eluted with 40 μl of EB, and the DNA concentration and library quality were determined with both NanoDrop and BioAnalyzer.
Genomic DNA libraries from FFPE tissue were prepared as described above with the following modifications: (i) Before end repair, 0.5 to 1 μg of genomic tumor DNA in 100 μl of TE was fragmented in a Covaris sonicator to a size of 100 to 500 bp. To remove fragments shorter than 150 bp, we mixed DNA with 25 μl of 5× Phusion HF buffer, 416 μl of ddH2O, and 84 μl of NT binding buffer and loaded it into NucleoSpin column. The column was centrifuged at 14,000g in a desktop centrifuge for 1 min, washed once with 600 μl of wash buffer NT3 (Clontech), and centrifuged again for 2 min to dry completely. DNA was eluted in 45 μl of EB included in the kit. (ii) The PCR program used was as follows: 98°C for 1 min; 10 cycles of 98°C for 20 s, 65°C for 30 s, 72°C for 30 s; and 72°C for 5 min.
Sequencing and analyses of Illumina fragment libraries
All samples were sequenced on Illumina HiSeq and Genome Analyzer II instruments. Sequence reads were analyzed and aligned to human genome hg18 with the ELAND algorithm using CASAVA 1.7 software (Illumina). Reads were mapped with the default seed-and-extend algorithm, allowing a maximum of two mismatched bases in the first 32 bp of sequence. Paired reads with duplicate start positions or reads with mapping scores of <50 were removed from further analysis.
Identification of somatic rearrangements
Somatic rearrangements were identified by querying aberrantly mapping paired-end sequences identified through ELAND. The discordantly mapping pairs were grouped into 1-kb bins and further considered when more than two distinct read pairs (with distinct start sites) spanned the same two 1-kb bins and suggested either an inter- or an intrachromosomal rearrangement ≥30 kb in size. To identify all high-confidence genomic rearrangements, we required candidate rearrangements to have at least one read sequenced across the rearrangement breakpoint. Breakpoints were identified at base-pair resolution with BLAT (34
) to compare the putative rearrangement to the human genome sequence (March 2006, NCBI36/hg18) with zero mismatches and <5 bp overlapping between the two rearranged loci. To reduce the number of false-positive somatic rearrangements identified, we applied a final set of filters. Candidate somatic rearrangements were required to (i) not contain sequences that are entirely repeat-masked in the 200 bp surrounding the breakpoint, (ii) contain paired-end reads spanning the breakpoint where >66% of the aberrantly mapped paired reads for each loci were involved in the candidate rearrangement, (iii) uniquely map to a region of the reference genome using a 36-bp sequence (two or less mismatches), and (iv) occur in a region not implicated in known human germline variation (35
These first three criteria were designed to eliminate those rearrangements identified by alignment artifacts to the reference genome, whereas the fourth criteria removed common germline polymorphisms. Candidate rearrangements were independently confirmed to be somatic when a 10-μl PCR-based reaction [containing 5.9 μl of H2O, 1 μl of 10× PCR buffer, 1 μl of 10 mM dNTPs, 0.6 μl of DMSO, 0.4 μl of 25 μM primers, 0.1 μl of Platinum Taq, and 1 μl of DNA (3 ng/μl)] resulted in the amplification of a product of the expected size using tumor DNA but not DNA from the matched normal tissue or lymphocytes.
Detection of copy number alterations
For each sample, distinct reads with mapping scores of ≥30 were mapped uniquely to chromosome arms. The total number of reads (one read was analyzed per cluster generated on the flow cell) mapping to each chromosome arm was compared to the distinct reads mapping to all 39 autosomal chromosome arms (only one chromosome arm from each of five acrocentric chromosomes was evaluated). Given the known GC bias of both next-generation sequencing library preparation and next-generation sequencing, we corrected for GC content across the genome (23
). The average number of distinct reads mapping to each 5-kb region of GC content (in 0.1% GC intervals) was determined. A weight was calculated for each GC interval equal to the global average number of distinct reads mapping to any interval of GC content divided by a specific interval of GC content (excluding intervals with more or less than 1.96 SDs of the global average number of distinct reads). After regions of known copy number polymorphism were removed (35
), the remaining GC-corrected reads were summed for each chromosome arm. The number of reads mapping to a given chromosome arm was compared to the total number of reads mapping to all 39 chromosome arms to calculate the percentage of genomic representation (GR).
To determine whether the observed percentage of reads mapping to a given chromosome arm in the plasma derived from patients with cancer was different from the observed percentage of reads mapping to a given chromosome arm in the plasma derived from normal patients, we calculated the z
A cutoff of ≥11.88 or ≤11.88 was applied (P
< 0.05 after Bonferroni correction for 39 chromosome arms, Student’s t
test with three degrees of freedom) to indicate either a gain or a loss of the chromosome arm.
To combine the significance of multiple chromosome arm alterations in plasma cancer samples, we used the five chromosomes containing the arms with the largest absolute z
scores in each case to calculate a PA score. The top five z
scores for each case were converted to P
values (Student’s t
distribution, three degrees of freedom), and the negative of the sum of the logarithms of the P
values was calculated for each tumor and normal case. The PA score was calculated as follows:
PA scores higher than the threshold score of 5.84 provide a specificity greater than 99% (Student t
distribution, three degrees of freedom) for the presence of aneuploid circulating tumor DNA.
Simulated sensitivity of PARE approaches
Estimates of the number of sequence reads needed for detection of circulating cell-free tumor DNA were determined for PARE with paired sequence reads of 100 bp in length and the following assumptions:
- Each tumor was conservatively assumed to have at least two separate structural alterations per genome. This is consistent with previous reports of most human cancer samples analyzed to date using whole-genome sequencing having at least these many alterations per genome (36).
- Circulating cell-free DNA was assumed to be a mixture derived from normal and tumor DNA.
- Rearrangements were assumed to be either heterozygous or amplified at 50 copies per nucleus. These conditions illustrate the boundaries of rearrangement copy number changes in tumor samples and show the increased sensitivity of detecting such alterations in tumors with focal amplifications. The upper level of 50 copies per nucleus is based on reported examples of EGFR and MYCN amplifications in human cancers (37, 38).
- The sequence data obtained were sufficient to allow sequencing at least one of the rearrangement breakpoints at least twice to pass bioinformatic filters.
- Rearrangements that passed our bioinformatic filters were assumed to only be present in tumor cells.
Using this approach, the minimum fraction of circulating tumor DNA detectable was estimated for given levels of sequencing requiring a detection sensitivity of 90% for detection of at least one rearrangement.
Simulated sensitivity of DK
Estimates of the number of sequence reads needed for detection of circulating cell-free tumor DNA were determined for DK with paired sequence reads of 100 bp in length and the following assumptions:
- The simulated next-generation sequence data for 10,000 simulated normal cases obeyed a normal distribution centered on the observed experimental means and SD for each chromosome arm proportion observed in the 10 experimental normal cases analyzed (N1 to N10).
- The simulated next-generation sequence data for 36 colorectal cancer cases and 45 breast cancer cases obeyed a normal distribution with SD for each chromosome arm proportion observed in the 10 experimental normal cases analyzed (N1 to N10).
- The karyotypes of SNP data previously obtained for 36 colorectal cancers and 45 breast cancers (25) were inferred with SNP allele frequencies (39). Copy number was inferred from the genotypes of alleles A and B with the following criteria: B allele frequency of 0.1 to 0.3 (tetraploid, AAAB genotype), 0.3 to 0.4 (triploid, AAB genotype), 0.4 to 0.6 (diploid, AB genotype), 0.6 to 0.7 (triploid BBA), and 0.7 to 0.9 (tetraploid BBBA). The above criteria were used to determine copy number of each chromosome arm based on informative SNPs. If multiple genotypes were present on an arm, the most frequently observed genotype was used to determine copy number. If >60% of informative SNPs did not meet the above criteria, the arm was considered to be present at a single copy. At least five chromosome arm alterations were observed in every sample analyzed in contrast to other copy number alterations (for example, focal amplifications more than fivefold and whole chromosome changes), which were observed in only a fraction of cases.
- As the amount of sequencing increased, the SD decreased proportionately to the square root of the number of reads obtained, similar to that expected from previous analyses of circulating fetal DNA (8, 9).
To estimate the sensitivity and specificity of detecting copy number alterations in the plasma using one HiSeq lane of sequencing, we calculated the chromosome arm proportions in R (version 2.3.1) (40
), adjusted for the aneuploid karyotype by multiplying the simulated genomic proportion for each arm by the number of copies of the respective arm divided by two for each of the 81 human cancers based on the karyotypes indicated above in (iii) and then converted to z
scores. PA scores were calculated from z
scores as described above with data from 10,000 simulated unaffected cases as normal controls. ROC analyses were performed with the PA scores at different cutoffs for a given tumor fraction to assess sensitivity and specificity. Similar analyses were performed at sequencing levels higher and lower than one Illumina HiSeq lane with the observed mean proportions and fold-adjusted SD. In each case, the theoretical SD (calculated as the square root of the ratio comprising the genomic proportion multiplied by one minus the genomic proportion divided by the total tags sequenced) was based on the amount of simulated sequence data and multiplied by the fold difference between the observed and theoretical SDs from the observed data of unaffected individuals (N1 to N10) obtained with one Illumina HiSeq lane (table S2
). The minimum fraction of circulating tumor DNA detected for a given level of sequencing was then interpolated with the observed mean and SD for each chromosome arm at 90% sensitivity and 99% specificity.
Quantification of tumor burden in plasma
Digital PCR with PARE biomarker–specific primers was used to quantify circulating tumor DNA in patient plasma as described (6
). Briefly, circulating tumor DNA was PCR-amplified with patient rearrangement-specific primers (at a final concentration of 0.5 μM each) and 2× Phusion Flash PCR Master Mix. Control primers to the unrearranged allele of each locus were used to establish the amount of unrearranged DNA in each sample. The total DNA amount in each sample was considered to be the sum of rearranged and unrearranged DNA. Quantification through copy number analysis of chromosome arms was performed as previously described (24
) for cases CRC13 and CRC16, where the average tumor fraction of chromosome arms with absolute z
scores of ≥11.88 were used to determine circulating tumor burden. A Pearson correlation between the tumor burden in plasma samples and PA scores was calculated for all cases, where circulating tumor DNA levels could be independently measured with digital PCR with PARE biomarker–specific primers.