Copy number variations (CNVs) play an essential role in facilitating human diseases susceptibility [
1,
2] and have been shown to be one potential source of missing heritability of complex diseases [
3]. Together with genome-wide association studies (GWAS), CNVs are predicted to be compelling in deciphering the pathology of human diseases [
4]. SNP arrays have been widely used for CNV studies, and tremendous data have been generated [
5-
7]. Although high throughput sequencing technologies are emerging and have been applied to genetic variation (including CNV) studies, the cost of a sequencing-based approach is still higher than traditional SNP arrays, especially in library construction [
8]. In addition, various studies have shown that the sequencing data are not sensitive to breakpoint detection [
9-
11]. Moreover, sequencing technologies have poor mutation detection capability when the sequencing coverage (read depth) is relatively low [
12]. Thus, at their current stage of development, we believe that sequencing technologies are complementary, not substitute, tools of SNP arrays. Therefore, in this article, we aim to develop a new and more accurate CNV detection pipeline that avoids the common difficulties in SNP array analysis.
High quality CNV calls for accurate estimation of raw copy numbers and requires that statistical models be optimized [
6]. Although many methods have been developed for CNV calling from array-based data [
7,
13-
16], their accuracies are still far from satisfactory by the high incidence of false discovery rates (FDRs) [
5,
17-
19]. The high FDRs of these methods mainly arise from (1) cross-hybridization of probes [
20], (2) genomic waves of intensities [
21-
23] and (3) sample dependence of outputs [
24-
26].
Cross-hybridization between probes and off-target sequences is a longstanding problem in microarray analysis [
27-
30]. Therefore, most previous methods have typically ignored cross-hybridization and focused on taking mean or median intensities of probes as the estimated raw CNs [
15,
31]. However, such estimated CNs hardly reflect the true allelic concentrations (ACs) of target sequences, and some studies [
6,
7,
20] have shown that cross-hybridization, if not considered, can lead to large bias. To circumvent this problem, one prior investigation used PICR (probe intensity composite representation) to model the hybridization and cross-hybridization based on the underlying physicochemical principle of DNA/DNA duplex formation in array experiments, and then removed the effect of cross-hybridization and accurately estimated AC at a given SNP site through a statistical method [
20]. Other similar models were also reported [
28,
32].
In addition to cross-hybridization, Maris et al. have stated that “whole-genome microarrays with large-insert clones designed to determine DNA copy number often show variation in hybridization intensity that is related to the genomic position of the clones.” [
22] These ‘genomic waves’ have been observed in SNP arrays [
21-
23]. Genomic waves are shown to be correlated with GC-content [
21,
23] and may stem from the amplification of DNA fragments [
33]. In the preprocessing of arrays, DNA samples are first digested with restriction enzymes, such as Nsp, and then ligated with adapters before amplification. However, owing to differences in amplification efficiencies of fragments, the PCR procedure can bring in artifacts which may give rise to genomic waves [
33]. Presence of the waves will hamper detection of aberrations [
23] and introduce hundreds of potentially confounding CNV artifacts that can obscure bona fide variants [
33]. To solve this difficulty, a computational approach via fitting regression models with GC-content included as a predictor variable was proposed by [
22], and this approach have improved the accuracy of CNV detection.
Finally, it has long been known that different sample batches can lead to inconsistent results, even if data are collected by the same lab [
24-
26]. Owing to this effect, statistical power in meta-analysis of multiple samples may be significantly reduced [
34]. Almost all existing algorithms require multiple samples for training because of the numerous parameters, while different training sample batches can lead to different parameter estimation. The inconsistencies may be incurred by this sample-dependent parameter estimation. The effect has also been shown to be correlated with differences in batch sizes and the extent of homogeneity of samples in each batch. Hence, samples with high homogeneity are suggested to be placed into the same training batch [
26]. Several other methods to adjust this batch effect have also been proposed, such as [
25,
35,
36].
To the best of our knowledge, existing methods only address one or two of the three factors discussed above. In this study, we developed a novel CNV detection pipeline based on hybridization and amplification rate correction (CNVhac
a) to accurately detect CNVs for Affymetrix SNP array. In contrast to previous methods, CNVhac takes into account all three factors by proper modeling of cross-hybridization, smoothing genomic waves and alleviating sample batch dependence of parameter estimation, thus significantly improving the accuracy of CNV detection. Starting from dozens of basic constants concerning binding affinity, which can be well trained from one single array and are quite stable between arrays, CNVhac is able to get the binding affinity between all probes and sequences without suffering from sample batch dependence. Then CNVhac applies the PICR method [
20] to address the effect of cross-hybridization. Finally, since we have found that the relative amplification efficiencies between different fragments are fairly stable from one array to another, a simple adjustment approach is proposed to smooth the genomic waves. Based on the accurate raw CN estimates, a hidden Markov model (HMM) is also proposed to detect breakpoints along the genome. The implementation of CNVhac with public datasets shows that our method does enhance the power of both raw CN estimation and CNV calling.