Proof of concept of the sequencing-based DNA copy number profiling assay
To develop and evaluate a sequencing based platform for assaying CNV, we prepared genomic DNA libraries from a pool of DNA derived from blood from non-diseased males; a pool of DNA derived from blood from non-diseased females; and DNA from the UMC-11 cell line, a lung carcinoid-derived cell line. DNA was fragmented using DNase. We did not test whether DNase can shear heterochromatin sequence such as in centromeres; however, we used only sequence reads that align uniquely to the genome and thus would not use sequence reads from regions of low sequence-complexity. The ends of the fragmented DNA were filled in or cleaved to produce blunt ends. This blunted DNA was ligated to adapters containing reverse complements on like adapters for suppression PCR as well as priming sites for a set of universal PCR primers and the Illumina sequencing primer. These "DNA Libraries" were then amplified using standard PCR conditions. We did not use size-selection after fragmentation, allowing lab-automation of library construction.
We sequenced libraries in the Illumina Genome Analyzer II sequencer, generating 6,744,152 (male pool), 3,283,370 (female pool), and 5,204,934 (UMC-11) reads (Table ) (NCBI GEO accession GSE21159) Each 36 nt read contained a 3 nt molecular barcode for potential sample multiplexing which was trimmed before alignment.
| Table 2Sequencing reads (33 nt) and alignments. |
Genomic alignment
The resultant sequence reads were computationally aligned to the genome using the algorithm BWA [
20]. We used only reads aligning to the genome with the highest (least ambiguous) score, minimizing incorrect mappings such as those from the female pool mapping to the Y chromosome. A shortcoming of the use of uniquely aligning reads is that reads aligning to sequence-identical segmental duplications within the HG18 genome will be discarded and these regions not monitored. Between 92% and 93% of the reads aligned to the genome, with between 63% and 65% aligning unambiguously (Table ). For each unambiguously aligned read, we recorded the chromosome and 5' coordinate.
We computationally tested use of different read lengths. Shorter reads are more likely to align ambiguously whereas longer reads are more likely to align to a unique genomic location. We computationally generated all oligos (tiles) from the human genome at lengths from 20 nt to 200 nt. We aligned all to the genome and, for each tile length, determined the fraction of the reads that aligned to only one location (Figure ). Seven percent of the human genome remains undetermined, containing repeat and low-complexity sequence patterns, e.g. centromeres.
Only 26% of the 20 nt tiles align uniquely, increasing to 65% at 25 nt, and to 70% at 30 nt, and smoothly increasing to 90% at 200 nt. A fundamental shift occurs between 20 and 25 nt: genome uniqueness changes from majority ambiguous to majority unambiguous. The 33 nt length (36 nt minus the 3 nt barcode) used in the study here represents an effective trade-off between sequencing costs (time and money) and read uniqueness.
Counting & uncertainty
We selected biological elements for examination, such as DNA windows. For each window, we counted the number of reads aligning within the window. We compared the number to that found in a diploid reference sample. We used the DNA pooled from non-diseased males as a reference as the DNA should be diploid across autosomes and represent all chromosomes, including chromosome Y.
We defined an error model for a measurement "
x" as a Poisson term from counting statistics:
With this error model, a block with 100 reads would have an uncertainty of ± 10. This error model reflects sampling uncertainty but neither biological or operator variability.
We used a uniform window size across the genome; however, a variable window size could be selected. The selected window size should reflect the total number of uniquely aligned reads available, the desired sensitivity/significance, and the sample cell-to-cell genomic heterogeneity. Here, we assumed that the cellular DNA was homogeneous and wished to have the power to distinguish copy number 3 from 2 at a p-value better than 0.001 assuming the described error model. This equates to 110 reads per genomic block. As we had 3.3 million reads from the female blood pool, of which 2.1 million uniquely aligned to the genome, and given the human genome of 3.1 billion base pairs, this results in a block size of 164 kb. We rounded to 150 kb blocks.
Normalization
We normalized reads to account for the differing numbers of reads per sample and to determine absolute copy number. We found that the relative number of reads mapping to a DNA block depended on the molecular biology protocol used, thus necessitating use of a empirical normalization rather than simply a measure such as GC content. We used a male pool as a reference, and with this normalization were able to derive the copy number, upper and lower bounds, and the significance of any variation.
We counted the reads mapping to each 150 kb block from the male and female pools and generated a ratio, female-to-male, for each block. There were roughly twice the number of reads for the male sample (Table ), and thus the distribution of ratios peaks near 0.5 (Figure , top left). Normalizing by the number of aligned reads from the female and male samples results in a distribution peaked near one, as expected for samples with similar ploidy, and copy number two, assuming the male sample pool is diploid across autosomes (Figure , lower left).
The distribution of UMC-11 to male ratios after normalization by the number of reads shows multiple peaks, none of which are centered at an integer (Figure , middle right). We tried using the ratios to determine the normalization: using the ratio mean or the median failed to center a distribution peak at an integer. However, we were able to normalize the counts based on the mode of the ratio distribution across autosomes. After normalization by the ratio model, we find that the three peaks in the ratio distribution from the UMC-11 cell line are regions with one, two, and three copy number (Figure , lower right).
Comparisons between samples
To compare CNV between samples, we calculated the significance of a difference using a t-test and p-value based on the normalized counts and normalized uncertainties. For each DNA element, the assay thus returns the absolute copy number, uncertainty, upper and lower bounds, and a p-value representing the significance of a measured difference.
Validation with qPCR
We used qPCR (Taqman) to assay DNA copy number in UMC-11 cells. TaqMan primer-probe reagents were obtained through the Applied Biosystems Assays-by-Design custom assay service (Foster City, CA) and designed to fall outside of repeat regions.