Allelic imbalance ratio
TAPS uses the B-allele frequencies (BAF), defined as (B/(A + B)), where A and B are the normalized intensities of the A and B probes, to calculate the allelic imbalance ratio of genomic segments. TAPS takes the absolute values of BAF - 0.5 (the distance to equal A and B signal, for each SNP) and clusters on two means, representing heterozygous and homozygous SNPs. The allelic imbalance ratio is produced by dividing the inner cluster center by the outer. The resulting value will be close to zero in cases of a balanced copy number variant (usually about 0.1 due to forcing two means, and the effects of noise), and similarly close to one in cases of a very unbalanced copy number variant (such as a high copy number with LOH) and very low normal cell content.
Copy number visualization
For each segment, TAPS considers the mean Log-ratio of all probes and the allelic imbalance ratio of the SNPs. The mean Log-ratio reflects the total copy number of the segment. The allelic imbalance ratio reflects the relationship between the alleles. However, with an unknown average ploidy of tumor cells and an unknown proportion of normal cells, the exact relationship between Log-ratio, allelic imbalance ratio and the allele-specific copy numbers of the tumor cells will vary between samples.
To visualize the tumor aberrations in a sample, Log-ratio is plotted against allelic imbalance ratio for all segments. A high proportion of normal cells reduces the allelic imbalance caused by imbalanced tumor aberrations, and the effect on Log-ratio of total copy number changes. However, segments will still appear in a predictable fashion with respect to one another, and a good assessment can be made with as little as 30% tumor cells (Figure ).
Copy number calling
TAPS includes an algorithm for automatic estimation of total and minor copy number. It first estimates the (sample-specific) relationship between Log-ratio, allelic imbalance and copy numbers. This crucial step can be assisted by a visual interpretation of the TAPS scatter plots. The calling algorithm implemented in TAPS then uses the Log-ratio and allelic imbalance ratio of lower copy numbers to estimate the characteristics of higher copy numbers. By iteratively working from lower to higher copy numbers, TAPS continuously adjusts expectations according to observations. TAPS is available from the authors as extensively commented R code. A simplified overview is presented here.
Step 1: estimate the Log-ratio of copy number two, using the Log-ratio and allelic imbalance ratio of the lowest-intensity long autosomal segments. The relatively low allelic imbalance of unaltered regions compared to LOH and single-copy gains and losses is the best indicator of copy number 2.
Step 2: find the allelic imbalance ratio of cn1, cn2m1 (2 with minor copy number 1) and cn2m0 (2 with minor copy number 0, that is, LOH) from all segments belonging to copy numbers 1 and 2.
Step 3: if step 1 or 2 fails, the analyst may supply an initial interpretation from a TAPS scatter plot.
Step 4: for each successive higher copy number, use the difference in Log-ratio between lower copy numbers to estimate its Log-ratio. Set it to the median of any segments that match the expectation well (note that segments are weighted on their length). If no such segments exist, set it to the expectation. The Log-ratio difference between successively higher copy numbers tends to drop slowly but steadily, and this way TAPS adjusts its expectations according to observations in the current sample.
Step 5: at copy number 3 and higher, use the differences in allelic imbalance ratio seen on lower copy number variants (such as cn1, cn2m1 and cn2m0 for copy number 3) to predict the allelic imbalance ratio of copy number variants (such as cn3m1 and cn3m0). Set them to the median of any segments of the correct copy number that closely match the expectation (note that segments are weighted on their length). If no such segments exist, set it to the expectation. This step uses the tendency of copy number variants with the same minor copy number to line up diagonally (with a slowly decreasing slope), which can be seen in the TAPS scatter plots.
Sample preparation and microarray experiments
Twelve colon cancer samples were selected from a set of immediately frozen tumor biopsies from patients operated upon for a colorectal cancer at the hospitals in Uppsala or Västerås, Sweden. Two of the 12 had appeared to fit a conventional copy number analysis well, while the remaining 10 had raised suspicions of hyperploidy. All patients gave informed consent according to the research ethical committee at Uppsala University for the storage, isolation of DNA and use of the material in research projects. The tumor cell content in each sample was at least 50% based upon an examination by a pathologist (JB or PM) of a hematoxylin-eosin stained section. All patients had stage II and III colon cancer. All samples were fully anonymized. DNA was extracted from two to ten frozen tissue sections (10 μm) using the QIAamp DNA Mini Kit (Qiagen, Hilden, Germany). DNA concentrations were measured with a ND-1000 spectrophotometer (NanoDrop Technologies, Wilmington, DE, USA).
Lung cancer cell line H1395 and patient-matched blood cell line BL1395 were obtained from the American Type Culture Collection (ATCC) and cultured according to their recommendations. DNA extraction was performed using the DNeasy Tissue Kit (Qiagen). Dilutions representing 30, 50 and 70% tumor cell content were prepared from the extracted DNA. The higher (near-triploid) DNA content of the tumor cells was compensated for by using 42, 65 and 80% tumor DNA.
Array experiments were performed according to the standard protocols for Affymetrix Genome-Wide Human SNP Array 6.0 arrays (Cytogenetics Copy Number Assay User Guide, P/N 702607 Rev2), Affymetrix Inc., Santa Clara, CA, USA). Quality control was performed in Affymetrix Genotyping Console version 3.0. Array data, including the cell line and colorectal tumor copy numbers, are available at the GEO [GEO:GSE26302].
Data preparation and analysis
Raw data (.CEL files) from colorectal cancer samples were normalized, Log-ratio and allele frequency was extracted and segmentation was performed in BioDiscovery Nexus Copy Number 3.0 with European HapMap samples as a reference set and using the Rank Segmentation algorithm based on CBS. Downstream analysis was performed in R using the TAPS suite, including allelic imbalance ratio calculation, plotting and copy number calling.
Published lung cancer cell line raw data (Affymetrix GeneChip Human Mapping 250 K) were processed in BioDiscovery Nexus Copy Number 3.0 with European HapMap samples as a reference set and using the Rank Segmentation algorithm. Downstream analysis of Log-ratio, allele frequency and segments was performed in R using the TAPS suite. The average copy number of each sample was read from the TAPS scatter plots. SKY karyotypes from samples H2122, H2126, H1395, H1437, H1770, H2087 and H2009 were downloaded and used to verify the result of TAPS [18
]. Summaries of the analysis are available in Additional file 2
Copy number analysis with PICNIC, GAP, PSCN and TAPS was performed on SNP6 raw data from the three diluted samples (30, 50 and 70% tumor cells) and the pure H1395 cell line. We selected all aberrations on which the allele-specific copy number calls of at least three of the four methods coincided for the all-tumor-cell sample. These were 35 large regions, representative of all types of copy number aberrations in the sample, and covered the majority of the genome. We then observed whether the four methods, for each region, gave matching copy number calls in the normal cell-diluted samples. Sensitivity was calculated as percentage of the 35 aberrations that were mostly correct (correct allele-specific copy number in more than half of that region). Since different segmentation strategies are used by the different methods, exact breakpoints were not considered important. Specificity was measured by first defining the truly unaltered genome using the pure tumor sample and concurring (heterozygous copy number 2) calls of at least three methods. We then summed up, for the four methods and the diluted samples, the percentage of the truly unaltered genome (True negatives + False positives) that were reported as such (True negatives), applying the general definition of specificity as True negatives/(True negatives + False positives). Automatic copy number analysis was used with all methods.
DNA ploidy analysis
Formalin-fixated, paraffin-embedded tissues corresponding to four of the colorectal cancer tissue samples were deparaffinized and analyzed for DNA content as previously described [22