Structural variations play an important role in diseases like cancer. Despite the presence of many tools described earlier in detecting copy number variations, there is a need for studies in detecting acquired structural variants in tumor (or diseased) samples using improved methods of detection. This report describes an approach, COPS, to discover SCNAs, i.e. sample-specific regions of copy number alterations between paired (normal and tumor) samples. We used trial data and comparisons to other methods to judge the performance of COPS. COPS fine-tunes common, if not universal, approaches (binning, ratios, smoothing, etc.) based on read depth, and incorporates an internal, robust and non-heuristic statistical method to judge the probability that a called CNA is truly deviant. Despite this simplistic approach, the method and results obtained are useful to the biology community where a simple approach to find pair-wise and tumor (disease)-specific copy number alterations is desired that can run on a desktop computer with very little knowledge and know-how on sophisticated bioinformatics tools.
Most CNV detection tools are optimized to perform well for a single sample (and not paired samples) and for a particular size range of amplifications and deletions 
and since different complementary approaches discover ~30–60% of CNVs, the results obtained using these tools cannot directly be compared to each other 
. On the other hand, COPS performed well over a wide size range of SCNAs and for different read lengths. Any sequencing errors and/or experimental anomalies introduced during imaging and/or sequencing library preparation do not account for any possible bias in our analysis as both test and ref samples are equally subjected to those biases, hence giving rise to real tumor (disease)-specific alterations. COPS scales up well in detecting larger SCNAs (>10 kb), in terms of sensitivity, specificity and size deviation. The improved performance of COPS compared to other tools at a higher size range works to its advantage in detecting cancer-specific SCNAs. We tested the performance of COPS repeatedly on simulated and real data, and find the results obtained using COPS to be reproducible for a given dataset, as expected for a non-heuristic approach. Some CNV detection tools like RDXplorer 
adopts a method of filtering out reads of low mapping quality (<Q30). Such a filter is not necessary in a pair-wise approach like COPS. Another pair-wise CNA detection tool, CNASeg 
, also uses the depth of coverage information to calculate CNAs in tumor samples. However, we could not include CNASeg in our performance comparisons due to lack of availability of a compatible (working) version of the software that works in our computing environment (personal communication with Sergei Ivakhno). The post-processing errors in filtering false positives and merging are lowered when the paired log2
ratios are significantly different from 0, therefore, making COPS perform well in detecting larger CNAs. CNV-Seq 
and SVDetect 
use paired log2
ratios to calculate CNAs but perform poorly in our comparative study. This is most likely because they lack any pre- or post-processing steps, such as defining undefined log2
ratios (caused due to lack of reads in either test or ref or both samples) based on their neighboring bins, smoothing of the data, filtering false positives and merging SCNAs.
Additionally, bin size is one of the important factors in determining the accuracy of SCNA identification and varies according to read length, sequencing coverage (Table S1
) and data quality 
. However, since our approach is based on depth of coverage at each nucleotide position, we used a fixed bin size that renders its performance invariant across read lengths. The current tools for CNV detection do not detect all the true positive CNAs across the genome for a wide-range of read lengths. Abyzov et al. 
discuss the need of alternative approaches for detecting CNVs with sequencing data of larger read lengths. However, we find that COPS scales up in its performance for reads with length upto 150 base pairs for most CNA size ranges, partially corroborating the finding of Abyzov et al 
Alignment of raw sequence reads to a reference genome is the first step in NGS data analysis. Read lengths, sequencing errors, repeat regions of the genome and presence of SNPs and/or indels affect the efficiency of alignment of reads to the reference genome. Data from our lab 
have shown that post-alignment base calibration and not the alignment per se has a huge impact on finding true positive single nucleotide variants from the sequencing data and increases the sensitivity of detection of variants. Although the effect is minimal, it was not surprising that the some of the most sensitive aligners performed better, although marginally over others, when tested with COPS for the detection of CNAs. COPS does not contain any module for correction of GC bias during sequencing. In an approach based on inter-sample ratios, we believe GC correction is not necessary, because the bias within a bin is inherently corrected for during calculation of the ratio. COPS, being a paired ratio-based approach, allows analysis of reads to repeated gene clusters and segmental duplications such as the beta-defensin gene 
. Tumor heterogeneity is a major issue that may complicate the downstream sequencing analysis with cancer samples. International Cancer Genome Consortium requires researchers to use samples with at least 80% tumor cells on histological assessment and less than 20% necrotic/normal cells 
. Presently most researchers focusing on cancer genome sequencing use samples with very high degree of tumor cells in their samples. However, in order to cover a wide variety of cancer samples, both sequencing technology and analytical tools need to be developed that can take into account high degree of cellular heterogeneity. COPS is not designed to be used for samples that has high degree of heterogeneity and assumes a very high percentage of tumor cells in samples. Additionally, as COPS relies on a paired approach, it assumes uniform sequencing coverage for both the ref and test samples. In case, the samples are sequenced at different read coverage, the ratio of the coverage can be factored in to accordingly determine what ratio of read-depth can be termed as baseline neutral.
Once we validated SCNAs detected by COPS with high-density whole genome SNP microarray using real tumor:normal sample pair (), we wanted to test the impact of read coverage on the sensitivity of SNCA detection. We found that the required resolution in binned read depths to call pair-wise CNAs dropped for reads with coverage <
5X, particularly when the binned read depths for one or both the samples was too low. This was confirmed by our observation of lower concordance between CNAs detected using COPS on low coverage (<5X) tumor:normal complete genome sequencing data, and subtractive CNVs detected for the same samples using the whole genome SNP microarrays (). By increasing the threshold further in the CNA regions detected in the microarray data (by filtering out the low coverage bins with read depths of <
7.5X), the concordance of finding CNAs between sequencing and array data increased to 95.9%, validating the dynamic range drop off at ~7.5X with the simulated data (). SCNAs detected across individual chromosomes also indicates a dynamic range drop off of ~7.5X for a majority of the chromosomes (Table S2
Boundary mapping is an important step for any CNA detection. There are reports that use soft-clipped reads to detect breakpoints 
. However, soft-clipped read mapping gave rise to a higher percentage of false positive breakpoints in our sample. Instead, the approach of using anomalous read mapping and difference in density between anomalous reads proved to be a better approach in detecting precise boundaries. This is demonstrated by the higher percent of breakpoint concordance between SNP microarrays and sequencing reads. The exact boundaries of CNVs, hence exact breakpoints, depend on the upstream aligner used to map short sequencing reads to the reference genome. Introducing boundary correction based on differential densities in anomalous reads is aligner-dependent as different aligners use different parameters to map anomalous reads. COPS (when used without the boundary segmentation module) reported only those SCNAs, which fall within a 10% margin of variability in the CNA breakpoints, as found in simulated data. The tagging of simulated SCNA boundaries with anomalously paired reads was best demonstrated with the aligner Novoalign. This is not surprising given that Novoalign is one of the most sensitive aligners known 
. The breakpoint estimation algorithm used by Illumina’s plugin cnvPartition uses a systematic sliding window approach over 4, 8, 16 and 32 probes to detect consistent departure in preliminarily inferred copy number states from the neutral copy number state of 2, and thus identify maximally different segments 
. Breakpoints are then called at the boundaries of these maximally different segments and visualized by the Illumina software GenomeStudio. We found that most of the breakpoints that are found in arrays and not in the sequencing-based approach were due to the lack of reads in the sequencing data and those found with sequencing reads but not with the array-based approach were due to lack of any probes for those regions in the array. Unlike COPS, the boundary segmentation module relies on anomalous read mapping in individual samples, and hence, does not require equal read coverage of the test and the ref samples.
We have developed a pair-wise, easy to use, biologist-friendly, somatic copy number alteration (SCNA) detection tool, COPS, for short-read NGS data, specifically designed to identifying somatic CNAs in cancer/disease samples over a wide-range of read lengths. Also, we reported an independent boundary segmentation module, the results from which can be fed into COPS to fine tune the SCNA boundaries. Although COPS is not designed to detect CNVs between two different individuals but between paired samples from the same individual, its ability to find subtractive copy number alterations allows it to be applicable to different individuals, sequenced under the same conditions. Ratio-based approaches, using paired-sample approach, have been used in the past for CNA detection using sequencing and microarray data. We found that the challenge in discovering true CNAs using sequencing data (short-insert single/paired-end or long-insert mate pair) primarily lies in the choice of taking read depth and not read count along with the processing of data prior to and post calculation of ratios such as choosing the correct bin size, filtering background noise, merging bins and filtering false positives. COPS incorporates all these pre- and post-processing steps to allow for a smooth and progressively improving work flow for SCNA detection. By using a database of known CNVs discovered in normal population and other disease/cancer samples, one can find CNAs that might play specific role(s) in disease progression. Although the cost of performing sequencing for longer reads is going down, we have shown that to detect most true positive CNAs in cancer sample, one doesn’t need longer reads but decent coverage. We recognize that the ability to detect all the disease causing CNVs in a sample does not merely depend on the sequencing coverage but also on the ability of a particular technology/chemistry to reproducibly sequence the difficult/low complexity regions of the genome and hence the completeness of sequencing.