We present a novel computational framework, IgC2N, to identify and genotype copy number variants. We have applied IgC2N to genome-wide Affymetrix data for HapMap phase 3 samples. However, this approach is conceptually not restricted to array based data and can be applied to preprocessed sequencing data.
IgC2N is a three step procedure. The first step, Detection of Candidate CNV Loci, generates a list of putative CNVs. Unlike most common CNV detection methods, it does not detect CNV on a sample basis but combines information across samples, an approach which increases the detection power. The accuracy of CNV breakpoints will depend on the marker density of the platform used, e.g., a high resolution platform or deep DNA sequencing data will provide good accuracy. Even though not explicitly investigated, fine tuning of the significance threshold of the candidate CNV detection loci can be used to query complex CNVs 
by comparing boundaries of overlapping variants. In addition, the approach does not impose any restriction on the number of covering markers for calling a CNV, which allows for the detection of smaller CNVs with poor marker coverage. Recent studies 
have shown that smaller CNVs are currently being discovered. The second step, CN Class Detection
, classifies individuals into copy number classes for the candidate CNVs generated from the first step via an EM approach to a Gaussian Mixture Model problem. We also conduct a recursive outlier detection step to detect rare CN classes or rare CNVs which the classification methods fails to identify. In the third step, we make a novel attempt to estimate the reference model bias by using the relative 1-CN class difference (described in details in Materials and Methods
) between loci and some genetic information.
We evaluated the performance of IgC2N through a simulation study and assessed that it has at least 80% power to detect rare (1%) CNVs with sufficient marker coverage or sufficient variant size in datasets with larger sample size (N
2000) while detecting common CNVs with similar size or marker coverage with datasets of smaller sample size (N
By applying IgC2N to the HapMap 3 dataset, 734 novel polymorphic loci were identified which were not reported in DGV (as of March, 2010). We characterize this set of novel CNVs based on MAF, size and type of polymorphism (deletion, gain or both). We found that the majority of novel deletions are rare (<5% MAF) while the majority of gains are common (10–30% MAF), possibly reflecting the fact that deletions are easier to detect than gains 
. In terms of size, the majority of the novel CNVs detected by IgC2N are sCNVs, similar to the findings of Kato et al 
. An overlap analysis of the novel CNVs with genes and exons showed that the size of a CNV does not increase the likelihood of gene overlap, whereas larger CNVs tend to overlap with more exons. We then investigated the mechanism of formation of these structural variants and found that sCNVs are relatively more likely to be formed by VNTR compared to CNVs while CNVs are more likely to be formed by NAHR. Overall NHR constitutes the major part of all CNV formation mechanisms which is consistent with previous findings 
To validate the novel CNVs detected by IgC2N, we queried high resolution NimbleGen data with 75 bp resolution 
. 66.26% of CNVs (variants with size >1 Kb) were confirmed on a sample basis. We found a lower validation rate for CNVs of size ≤500 bp. Although one would expect a higher false positive rate for very small variants, it is also plausible that the larger probe size of the NimbleGen data (>60-mers versus 25-mers for the discovery platform) would be disadvantageous when positioned across breakpoints. Also, the smaller variants are enriched to be formed by VNTR or TEI and ensuing sequence complexities can explain the low validation rate. Ad hoc qPCR experiments would confirm or refute the existence of the variants that failed NimbleGen validation and also determine the accuracy of the genotyping step.
With the goal of evaluating the detection ability of IgC2N in comparison to other existing methods, we performed overlap analysis with the set of variants detected by IgC2N and the list of variants reported by McCarroll et al that implemented Birdsuite 
. We were able to detect 86.87% of the CNVs reported by McCarroll et al 
(applying constraints in terms of power as per IgC2N simulation), while McCarroll et al failed to detect 70.91% of the CNVs detected by IgC2N, 58.04% of which are reported in DGV. We looked at individual-level CN genotype comparison for overlapping CNVs and found that majority of CNVs show high level (between 90–100%) of concordance in genotypes across samples. As a surrogate measure for accuracy of the genotyping algorithm we evaluated Mendelian consistency in HapMap trios. The discordant rates of IgC2N were lower than Birdseye 
and dChip 
while being comparable to ÇOKGEN 
as reported in 
. These results demonstrate the detection ability and genotyping accuracy of IgC2N.
Finally, when assessing the functional impact of CNVs on the human transcriptome, we found that overall 4.4% gene transcript levels are significantly associated with CNVs at a false discovery rate of 10%, with 23% of the associations not being previously reported. In agreement with previous studies, investigation of the association between transcript and copy number changes in humans 
and in mice 
revealed greater functional impact from variants residing outside the protein coding gene locus. Interestingly, small variants were significantly more prone to affect transcript levels suggesting a preferential localization on (long distance) gene enhancers and repressors. In addition, variants involving gains were more likely to be effective than deletions. Based on the patterns of transcript levels versus observed copy number classes, it is apparent that different regulatory elements are partially controlled by genetic variants, either enhancers or repressors. Some examples of deletion/enhancer effect involve the regulation of pleckstrin homology domain containing, family F (with FYVE domain) member 1(PLEKHF1), and Parkinson disease (autosomal recessive, early onset) 7 (PARK7), and of farnesyl-diphosphate farnesyltransferase 1 (FDFT1) as deletion/repressor effect. Associations suggesting gain/repressor effects include BLMH, ASF1A and Mitochondrial ribosomal protein L17 (MRPL17). Gain/enhancer effects include NUTF2 and Transcription factor Dp-1 (TFDP1) ( and File S1
). Interestingly, strong associations were detected involving outlier transcript levels and rare gain variant (File S1)
as for chaperonin containing TCP1, subunit 6A (zeta 1) (CCT6A), complement factor D (adipsin) (CDF), and the gene coding the Insulin-like growth factor-binding protein 7 (IGFBP7), recently shown to alter the sensitivity to anticancer therapy 
The impact of genetic variants on gene expression represents one mechanism for phenotypic variation observed in humans and other species. To date the number of genetic regulatory effects is unknown, as the extent of genetic structural variants has only begun to be elucidated. SNPs and CNVs represent non-redundant of genetic variation as manifested by the fact that there is only partial overlap between gene expression-CNV and gene expression-SNP correlation 
. This is not surprising as CNVs and SNPs are not in complete linkage disequilibrium 
. Kasowski et al 
demonstrated that a significant fraction (26%–35%) of inter-individual differences in transcription factor binding regions coincides with genetic variation loci, suggesting a crucial role of cis
elements in the genetics of transcription factors. Altogether, there is increasing interest in identifying genetic variants that show regulatory effect and contribute to the explanation of phenotypic variation of humans. One might argue that discovery of CNV will plateau with the completion of the 1000 genome project (http://www.1000genomes.org/
providing a comprehensive list of CNVs with accurate breakpoints. However, array based data from large collection of individuals would continue to be necessary in studying the relationship between CNVs and different human diseases owing to its cost-effectiveness and methodological improvements on CNV discovery and detection can accelerate the success of large scale disease susceptibility studies.