The sensitivity and specificity of CNV identification is an essential component of association studies of CNVs with disease. In our evaluation of two commercial programs and two publicly available programs widely used for CNV identification in GWAS data, we found considerable variation among the programs in the number of CNVs called. The differences declined when the size of CNVs increased, but substantial variation still existed in the accuracy of CNV calls as determined by the recovery test and by qPCR validation, even for large CNVs containing more than 20 markers or larger than 100 kb in length.
For recovery of CNVs detected by Korn et al
in eight HapMap samples, Birdsuite was superior to the other three programs, as it recovered more CNVs in each category. However, the sensitivity of all four programs was poor when the number of probes spanned by a CNV was small. For CNVs containing more than 20 markers, Birdsuite recovered 88.5% of CNVs, which is comparable to Korn et al
's finding of 93.8% for Birdsuite 
. Some of the discrepancy may be explained by the fact that Korn et al
. analyzed the same samples but with different. CEL files produced by different labs; the files used in the present study were obtained from Affymetrix, while Korn et al
. used. CEL files produced in their own lab 
. Also, their Birdsuite recovery rate was inflated since no agreement on CNV state (deletion or duplication) was required by Korn et al
. just after the development of Birdsuite; we found that only 70% of CNVs containing 20 or more markers were recovered by Birdsuite with an agreement on CNV state.
Birdsuite's recovery rate of common CNVs containing more than 20 markers varied across genomic regions. For example, there were eight CNV regions carried by any two of the eight HapMap individuals, and therefore common. Six of those eight regions were recovered in both individuals by Birdsuite, but the other two regions were not recovered. Poor recovery rates of CNVs with high frequency (i.e., carried by more than three of the eight HapMap samples), was also observed (File S1
: Figure S1).
As expected, the sensitivity of Partek showed a 5.2-fold increase, to 60.7%, after using quantile normalization, as compared to Korn et al
.'s finding of 11.2% without using quantile normalization. A very recent study has shown that normalization can improve performance in analysis of miRNA array data and that quantile normalization is the most robust normalization method 
. Like all microarrays, Affymetrix SNP arrays are affected by systematic sources of experimental variation. Normalization can help reduce or remove noise that distorts the distribution of observed array data, and thus improve the accuracy of genotyping calls and copy number calls.
For our recovery test in 90 CEU HapMap samples, the highest recovery rate was only 47.91% from Birdsuite, in detecting CNVs spanned by more than 20 markers. Consistent with the recovery rate shown above, the average recovery rate decreased with increased CNV frequency. On closer inspection, it is clear that Birdsuite's recovery rate of most CNVs containing more than 20 markers was either ≤10% or >90% (File S1
: Table S2). For example, there were 32 CNV regions that showed a very high frequency in CEU HapMap samples (>80%). Six out of 32 CNV regions were 100% recovered but the recovery rates of 21 CNV regions were less than 10%, which is consistent with our qPCR results on tested common CNVs. Regardless the recovery rate, the number of CNVs identified by different programs varied greatly. The most CNVs were identified by Helixtree, almost 4 times the number of CNVs detected by Birdsuite and 7 times that of CNVs detected by PennCNV-Affy and Partek.
One possible reason for the discrepancy of recovery rate in two datasets (Kidd et al
and Conrad et al
) is that the detection method: very dense microarrays used by Conrad et al
vs. sequencing used by Kidd et al 
. Conrad et al
's experiment design aimed to discover CNVs of size greater than 500 bp. However, the median size of insert clone is around 40 kb in paired-end sequencing by Kidd et al
., which would make detection of small CNVs more challenging. Consequently, more small CNVs (≤5 kb) were validated by Conrad et al
, and more median size CNVs (10–50 kb) were reported by Kidd et al
. This may also partially explain the poor consistency rate of CNVs between the two studies.
There was significant variation in singleton deletion calls among programs: HelixTree detected about two-fold more singleton deletions in the BiGS data set (644 singleton deletions) than Birdsuite, Partek, or PennCNV-Affy. However, more than half of those detections were specific to HelixTree (59.3%), and the majority of HelixTree program-specific calls of deletions were not validated by qPCR. This meant that for program-specific singleton deletions, the positive predictive value of HelixTree was zero based on the qPCR validation of sampled regions. For Partek and Birdsuite, on the other hand, all selected singleton deletions were validated by qPCR. In the recovery test and in qPCR results on singletons, Birdsuite had the best performance among the tested programs.
We checked closely the program-specific calls because some authors have proposed to combine CNV calls from multiple independent software calls to improve accuracy. The program-specific calls in our data tended not to be validated (most likely to be in error). Positive predictive value could not be calculated in an unbiased manner in the present study since we only selected program-specific CNVs for qPCR validation. A sampling from all CNVs (both program-specific and shared calls) would provide a more accurate measure of predictive values.
More singleton duplications were called by each program in the BiGS dataset than singleton deletions, and each program called singleton duplications that were not called by any of the other three programs. The average positive predictive value for these program-specific singleton deletions for all four programs (65%) was higher than for singleton duplications (45%) in our tested regions. We speculate that the performance of programs in the detection of deletions is better than the detection of duplications because a deletion represents a 2-fold change in copy number while a duplication produces only a 1.5-fold change. As might be expected, 21 highly frequent (>80%) but poorly recovered (≤10%) CNVs were all duplications (File S1
: Table S2 Birdsuite's recovery data).
For common CNV regions identified by Canary in the present study, we observed striking variation in the frequency of calls by different programs. Each program had substantial false positive and false negative rates in at least one of the three CNV regions tested. The high false positive and negative rates observed here may be due to the fact that common CNVs will affect the mean and variance of hybridization intensity over the region included and thus affect the observed log2 ratio of the CNV. The fact may also lead to the poor recovery rate of highly frequent CNVs.
Plate effects may play a significant role in the accuracy of common CNVs called by Canary. CNP1293, which showed no plate effect, had a high sensitivity and specificity, but CNP2157 and CNP2057 had plate effects and showed a very low sensitivity and specificity. According to our limited qPCR results, the algorithms evaluated here for calling common CNVs all need improvement. We would conclude that without independent experimental genotyping, software-called common CNVs based on GWAS array data are not suitable for association studies. In contrast, rare CNVs called by Birdsuite and Partek are of substantially better quality. Similarly, Marenne et al
concluded that further validation was required to assess CNVs as risk factors in complex diseases when they evaluated CNVpartition, PennCNV and QuantiSNP using Illumina Infinium Human 1 Million SNP array data
Recently, Mei et al
developed two new methods to identify common CNV regions 
. They evaluated their methods with sequencing-based results from Kidd et al
. However, the lowest discordance rate was 55% after excluding individual regions with a confidence score (as developed by them) below the 80th
percentile. Two previously published methods, STAC and GISTIC had similar performance at identifying CNVs with high frequency and moderate confidence 
. These reports further confirmed our observation that common CNV detection methods still have much room for improvement.
Winchester et al
recommended using a second program to generate the most informative results 
. This recommendation seems to be based on the assumption that the second program performed similarly to the first one, and that their overlap increases the reliability. This might not be a safe assumption when not all software suites perform equally well. Birdsuite is a better choice for identifying rare CNVs than the others in our evaluation.
One limitation of the present study is the small number of CNVs tested by qPCR, particularly for the common CNVs. Although the number of qPCR tests performed for validation was limited, the overall trend of frequent non-validation is in agreement with other results from larger datasets (the recovery tests on HapMap samples). These two independent lines of evidence support our concerns regarding the validity of CNV calls based on GWAS data.
The intention of this study is to identify potential traps of current practice in the GWAS-based CNV analysis, rather than an attempt to provide a solution. It is possible that program tweaking would improve accuracy, but it appeared reasonable to start with the default parameters recommended by each program's provider.
We evaluated the reproducibility of the two “gold standards” used in this study, the paired-end sequence data of Kidd et al, and the very high density array Comparative Genomic Hybridization (aCGH) in the same 8 HapMap individuals. We found relatively poor consistency between the two “gold standards.” The lack of a standard sets a limit on much of the recovery of GWAS-based CNV calls, particularly for common CNVs since they would be over-represented when 8 individuals are studied. Next-generation sequencing of whole genome of population samples might be able to provide an ultimate gold standard for identification of common CNVs.
A more extensive list of independently validated CNV regions, and the raw hybridization or other data files used to detect them, should be made publicly available. A greatly expanded version of Kidd et al
's or Conrad et al
's HapMap data set used in this and previous studies 
, with all CNVs confirmed by high coverage sequencing, and with the addition of parental data, might provide an acceptable resource. The public data from dbGaP and similar sources can also be used for this purpose, as can the CNV validation data we have produced in this study and plan to produce in future studies. Once these datasets are available, independent validation studies must be performed, even though they require the expenditure of valuable time and funds.