Some studies have observed a differential taggability for common CNVs; these loci, it is argued, are less likely to be in strong LD (
r2>0.8) with flanking markers than are frequency-matched SNPs. Several explanations have been proposed to account for this observed differential taggability, such as lower SNP density in regions near CNVs and the higher mutation rates of certain CNVs (producing greater allelic diversity nearby) than those of SNPs
[2],
[14],
[15]. This observed difference in the magnitude of LD between CNVs and SNPs relative to LD among SNPs impacts, through its effect on allelic association, our ability to assess the phenotypic influence of CNVs. On the other side of this controversy, the recent genome-wide study of an extensive catalog of CNVs in 16,000 cases of eight common diseases has argued that CNVs are generally well-tagged by SNPs. Indeed, among 2- and 3- class CNVs that passed QC and had MAF>10%, the study found that nearly 80% were tagged by SNPs at
r2>0.80. Consequently, replication of association results for CNVs can be conducted in an independent sample set by the use of tag SNPs. In this study, we set out to conduct a study of CNVs by analyzing their effect on gene expression and their association with disease susceptibility and other traits. The CNVs that are well-tagged by SNPs, which we call tCNVs, facilitate SNP-based simulation studies to evaluate enrichment. We proceeded to test whether these CNVs were disproportionately more likely to be functional than frequency-matched SNPs, as trait-associated loci or, under the assumption that few trait-associated polymorphisms are likely to alter the composition of gene products, as eQTLs influencing phenotype by altering gene regulation. Our study found that CNV-tagging SNPs are enriched for
cis eQTLs, and, furthermore, that reproducible trait associations show an overrepresentation of tCNVs relative to frequency-matched SNPs. While the tagged CNVs are particularly easy to investigate in enrichment studies, we found that the proportion of eQTLs (at p value threshold of 10
−4) in the non-WTCCC CNVs (39%) is higher than in the well-tagged WTCCC CNVs (30%). Given these strong findings on the functional relevance of CNVs, we created a comprehensive online resource of expression associated CNVs in the HapMap populations to supplement our earlier studies on SNP eQTLs.
CNVs can affect phenotype in several ways
[16]. Genes fully covered by CNVs may contribute to disease through a duplication or deletion event. Copy number variant breakpoints may disrupt the expression of genes that overlap CNVs. On the other hand, we have identified two CNVs at considerable distance (in
trans) from their targets controlling transcript abundance as potential master regulators. Another CNV contains multiple regulatory elements which are each predicting the expression of at least 80 transcripts; the deletion of such important regulatory elements is likely to profoundly alter gene transcription. Importantly, we observed that 1,306 CNVs (out of the 3,432 CNVs included in the WTCCC study) harbor at least one SNP eQTL (defined at p value threshold of 10
−4). Given our earlier observation that tCNVs are enriched for expression-associated CNVs (eCNVs), it is interesting to ask whether those CNVs not well-tagged by SNPs have interesting properties with respect to gene regulation. Of the CNVs that are tagged at
r2<0.30, 44% harbor SNP eQTLs.
Previous studies
[7],
[15],
[16] have reported that genes that undergo dosage differences due to the presence (proximity) of CNVs show an enrichment for genes involved in immune response and response to external biotic stimuli. We identified an olfactory receptor gene
OR6J1 and a gene
DAD1 (defender against cell death 1), both on chromosome 14, that are
trans-regulated genes for the tCNV
CNVR5165.1 on chromosome 11. We found an overrepresentation for target genes involved in the calibrated molecular response to stimulus (whether chemical stimulus or potential internal or invasive threat). Our novel observation in this regard is that previously reported enrichment for genes relevant for molecular-environmental interactions generalizes to the target genes for tCNVs as eCNVs.
The recent WTCCC CNV study concluded that common CNVs that can be typed on existing platforms are unlikely to have a major role in the genetic basis of complex diseases
[8]. The same study reported that there was no enrichment of association signals among CNVs involving exonic deletions. Our findings recommend caution in assessments of the contribution of CNVs to the genetics of complex traits. Even under the assumption that most common CNVs are well tagged by SNPs and therefore interrogated by existing SNP GWAS, reproducible trait associations are enriched for these CNVs compared to random expectation. The prominence of such CNVs among reproducible trait associations with autoimmune disorders and metabolic traits suggests that these variants may indeed contribute to certain disease classes; alternatively, they may act in conjunction with other variants such as SNPs to confer susceptibility. Importantly, these CNVs are disproportionately more likely to predict transcript levels than frequency-matched SNPs, and they are more likely to affect many different gene expression traits as master regulatory polymorphisms. An important issue to address is whether the enrichment of tCNVs as eQTLs and as disease-associated SNPs are correlated. Note that the probability that a random SNP is found in the NHGRI catalog increases from 0.062% in the set of HapMap SNPs to about 0.5% (more than 5-fold) in the set of well-tagged CNVs that are not eQTLs and to 2% in the tCNVs that are eQTLs.
It should be noted that the WTCCC CNVs included in our study reflect certain limitations. Indeed, as the WTCCC study itself explicitly noted
[8], a large proportion (nearly 60%) of the candidate list of putative CNVs could not be reliably assigned copy number classes from the combination of experimental assay and analytical approaches; it is estimated that only about half of these are not polymorphic
[8]. Particularly, nearly 6,500 such putative polymorphisms were excluded from subsequent analyses, as they were called with a single copy number class. It is of course possible that our conclusions may not generalize to these CNVs. Furthermore, eQTL mapping in microarray-based studies in LCLs is likely to yield only a subset of the eQTLs that will be identified using more refined methods in a variety of human tissues. The present study also has little conclusive to say about low frequency variants. Despite these current limitations, the annotation system we implemented in SCAN should prove useful to other investigators and seeks to be as comprehensive as possible by providing functional information for the most extensive map of CNVs to date from the recent population-based genome-wide survey
[7]. Since the MAF spectrum of the NHGRI catalog of trait-associated SNPs from published GWAS is quite different from that of the SNPs on the genotyping platforms used to conduct these GWAS, we performed our enrichment analyses while conditioning on the MAF distribution. There are other potential representational biases in the NHGRI catalog of reported variants that may have affected our studies. The enrichment in disease classes relative to other classes of traits, for example, may be the result of increased power (e.g., due to greater sample sizes) for these categories.
Since gene expression is intermediate to other complex phenotypes, a global view of the influence of CNVs on the transcriptome may lead to a better understanding of their role in disease susceptibility. While we identified, as perhaps expected, CNVs located in the HLA region associated with multiple expression phenotypes and with various autoimmune disorders, we also observed, strikingly, several non-HLA CNVs that regulate multiple transcripts, including 2 (one on chromosome 10 and the other on chromosome 2) associated with more than 100 expression traits. Using the most extensive population-based CNV map available
[7], we observed a greater proportion of master regulatory CNVs than observed among the well-tagged CNVs. Collectively, all these findings reinforce the importance of considering all types of variation to elucidate the genetic architecture of complex traits.