We performed association testing at each of the CNVs which passed QC, in two parallel approaches. First, we applied a frequentist likelihood ratio association test that combines calling (using CNVtools) and testing into a single procedure, using an extension of an approach previously described 18
. Second, we undertook Bayesian association analyses in which the posterior probabilities from CNVCALL were used to calculate a Bayes Factor to measure strength of association with the disease phenotypes. Important feature of both sets of analyses are that they correctly handle uncertainty in assignment of individuals to copy-number classes, and by allowing for some systematic differences in intensities between cases and controls, that they provide robustness against certain artefacts which could arise from differences in data properties between cases and controls. There were no substantial differences between the broad conclusions from the frequentist and Bayesian approaches.
Our association analyses were based on a model in which a single parameter quantifies the increase in disease risk between successive copy number classes, analogous to that underlying the trend test for SNP data. Various analyses of the robustness of our procedure, adequacy of the model, and lack of population structure were encouraging (see SoM and Online Methods). For example, Supplementary Figure 23
shows quantile-quantile (QQ)-plots for the primary comparison of each case collection against the combined controls, and for the analogous comparisons between the two control groups. These show generally good agreement with the expectation under the null hypothesis.
Careful analysis of our association testing revealed several sophisticated biological artefacts which can lead to false positive associations. These include dispersed duplications, whereby the variation at a CNV is not in the chromosomal location in the reference sequence to which the probes in the CNV uniquely match, and a DNA source effect whereby particular CNVs, and genome-wide intensity data, can look systematically different according to whether the assayed DNA was derived from blood or cell-lines. See Box
“Some Artefacts in CNV Association Testing” for illustrations and further details.
Box. Some Artefacts in CNV Association Testing
Some types of artefacts, such as population structure and calling artefacts, are very similar to those seen in SNP studies. Others, related to differences in data properties between cases and controls, can be potentially more serious for CNVs 26,27
. In this box
we draw attention to some specific artefacts of biological interest that we observed and which researchers should consider as explanations of putative disease-relevant associations. We note that, for the unwary, some of these artefacts could easily survive “replication” of an association.
shows cluster plots for a particular CNV (CNVR2664.1) which exhibits a strong case-control association signal for breast cancer cases (p = 5×10−143
, higher copy number for disease) with a similar signal for rheumatoid arthritis (p = 3×10−27
), and a signal in the opposite direction for coronary artery disease (p=4×10−30
). The right hand class (green curve) has a higher frequency in BC (and RA), and a lower frequency in CAD. (Area under green curve is the same for each collection.) This turned out to be an artefact caused by differences in sex ratio in the various case and control samples (breast cancer: 100% female; rheumatoid arthritis: 74% female; coronary artery disease: 22% female; controls: 50% female). Comparing breast cancer cases against female controls abolished the signal. The CNV is annotated as being on chromosome 5 and all 10 probes in the CNV map uniquely to chromosome 5 in the human reference sequence. However, we found that SNPs which tagged the variation at this CNV all mapped to the X-chromosome and that the region containing the probes for this CNV is present on the X-chromosome in the Venter genome. We conclude that the CNV is a dispersed duplication, with the variation actually occurring on the X-chromosome, and not on chromosome 5. We found one similar example, of a CNV (CNVR1065.1, featuring in as a replicated association) annotated as mapping uniquely to chromosome 2 which shows a strong signal in type 1 diabetes and rheumatoid arthritis. Careful examination shows it to be another dispersed duplication where the polymorphism is located in the HLA, and is well tagged by HLA SNPs known to be associated with both diseases. Supplementary Figure 27
shows the clear evidence from inter-chromosomal linkage disequilibrium that these two loci are dispersed duplications.
Variation in DNA source
shows cluster plots for a different CNV (CNVR866.8) with striking differences in T2D as compared with the UKBS controls (or against just the 58C controls). The plots show histograms of normalised intensity ratios for 6 collections. Examination of the pattern across collections is interesting. The collections in the top row show a single tight peak towards the right of the plot. Those in the bottom row show a single, more dispersed, peak to the left. The collections in the middle row show evidence of both peaks. It turns out that for collections with the tight peak all DNA samples were derived from blood whereas all samples in the two collections with the single dispersed peak had DNA derived from cell lines. The remaining collections contain some DNAs derived from both sources. This CNV (and many others) thus exhibit systematically different behaviour depending on the DNA source. shows a plot of the second (PC2) and third (PC3) principal components of the array-wide intensity data (plot created using all samples post QC from all 10 collections using data from all CNVs with each point representing one sample, with the points coloured according to whether that sample was derived from blood (red) or cell-lines (blue)). It is clear that these two components can almost perfectly classify samples according to the source of the DNA.
Lymphoblastoid cell lines are typically grown from transformed B-cells, whereas DNA extracted from blood comes largely from a mixture of white blood cells. One specific feature of B-cells is that each B-cell has been subject to its own pattern of rearrangements around the immunoglobulin genes via the process of V-D-J recombination 28
. This suggests a natural candidate for our observed DNA source effect, and indeed the CNV illustrated in is located close to one of the immunoglobulin genes, as are the other instances we have found of similar gross DNA source effects. But it is not the whole story. Principal components analysis of genome-wide intensity data with any probe mapping to within 1Mb of an immunoglobulin gene excluded from analysis (Supplementary Figure 29
) shows reasonably clear discrimination by DNA source (though less clear than when all probes are included), with many probes, genome-wide, contributing to the discrimination.
Dispersed duplications and DNA source effects represent somewhat interesting biological artefacts. We also observed more prosaic effects. As one example, Supplementary Figure 30
shows that there are systematic effects on probe intensity of the row of the plate in which a sample was run.
Independent replication of putative association signals is a routine and essential aspect of SNP-based association studies. Particularly in view of the differences in data quality between SNP assays and CNV assays, and the wide range of possible artefacts in CNV studies, replication is even more important in the CNV context. Several possible approaches to replication are available. When a CNV is well-tagged by a SNP (or SNPs), replication can be undertaken by assessment of the signal at the tag SNP(s) in an independent sample, either by typing the SNP or by reference to published data. Where no SNP tag is available, direct typing of the CNV in independent samples is necessary, either using a qualitative breakpoint assay or a quantitative DNA dosage assay. In most cases there will be a choice of assays. Interestingly, replication via SNPs was possible for 15 out of 18 of the CNVs for which we undertook replication based on analysis of our penultimate data freeze.
plots p-values for the primary frequentist analysis for each CNV in each collection. provides details of the top, replicated, association signals in our experiment after visual inspection of cluster plots to detect artefacts not removed by earlier QC. Cluster plots for each CNV in are shown in Supplementary Figures 18 and 19
, and Supplementary Files 2
Replicated CNV associations and those at replicated loci
There is one positive control for the diseases we studied, namely the known CNV association at the IRGM
locus in Crohn's disease 7
. Reassuringly, our study found this association (p= 1 × 10−7
, odds ratio (OR) = 0.68; throughout, all ORs are with respect to increasing copy number).
We identified three loci – HLA for Crohn's disease, rheumatoid arthritis, and type 1 diabetes; IRGM for Crohn's disease; and TSPAN8 for type 2 diabetes – at which CNVs appeared associated with disease, all of which we convincingly replicated through previously typed SNPs that tag the CNV, and a fourth locus (CNV7113.6), at which there is suggestive evidence for association and replication in both Crohn's disease and type 1 diabetes.
We observed CNVs in the HLA region associated variously with Crohn's disease (CNVR2841.20, p= 1.2 × 10−5
, OR = 0.80), rheumatoid arthritis (CNVR2845.14, p= 1.4 × 10−39
, OR = 1.77), and type 1 diabetes (CNVR2845.46, p= 8 × 10−153
, OR = 0.2). Copy number variation has previously been documented on various HLA haplotypes 19
and due to the extensive linkage disequilibrium in the region it is perhaps not unexpected to have found CNV associations in our direct study. Linkage disequilibrium across the HLA region has hampered attempts to fine-map causal variation across this locus, and we have no evidence that suggests that the HLA CNVs associated with autoimmune diseases in this study represent signals independent of the known associated haplotypes.
We identified two distinct CNVs 22kb apart upstream of the IRGM
gene, both of which are associated with Crohn's disease. The longer CNV (CNVR2647.1, p= 1.0 × 10−7
, OR = 0.68) has previously been identified 7
as a possible causal variant on an associated haplotype first identified through SNP GWAS 14
, and acted as our positive control but the association of the smaller CNV (CNVR2646.1, p= 1.1 × 10−7
, OR = 0.68, located <2kb downstream from a different gene, MST150
) is a novel observation. While direct experimental evidence links the associated haplotypes with variation in expression of the IRGM
gene, it does not bear on the question of which of the two CNVs or the associated SNPs might be driving this variation 7
. Our conditional regression analyses on the two CNVs and SNPs on this haplotype do not point significantly to any one of these as being more strongly associated.
SNP variation in the TSPAN8
locus was recently shown to be reproducibly associated with type 2 diabetes 20
, but the potential role of a CNV is a novel observation. This CNV (CNVR5583.1, p= 3.9 × 10−5
, OR = 0.85) potentially encompasses part or all of an exon of TSPAN8
and so is a plausible causal variant. The most significantly associated SNP identified in the recent meta-analysis is only weakly correlated with the CNV as originally tested (r2
=0.17), and so the CNV may simply be weakly correlated with the true causal variant. Closer examination of probe-level data at this CNV suggests a series of different events (including an inverted duplication and a deletion) resulting in more complex haplotypes than those tested for association by our automated approach. With this more refined definition of haplotypes the signal is somewhat stronger. See SoM for details.
CNVR7113.6 lies within a cluster of segmentally duplicated sequences that demarcate one end of a common 900kb inversion polymorphism on chromosome 17 that has previously been shown to be associated with number of children and higher meiotic recombination in females 21
. The CNV shows weak evidence for association with Crohn's disease (p= 1.8 × 10−3
, OR = 1.15) and type 1 diabetes (p= 1.1 × 10−3
, OR = 1.13), but is in extremely high LD (r2
=1) with SNPs known to tag the inversion, and so is in tight LD with a long haplotype spanning many possible causal variants. This CNV encompasses at least one spliced transcript, but no high confidence gene annotations. Fine-mapping the causal variant within such a long, tightly-linked, haplotype is likely to prove challenging.
In addition to the loci in , we undertook replication on thirteen other loci, detailed in Supplementary Table 13
, for which there was some evidence of association (p<1×10−4
(Bayes Factor [BF])> 2.1) in our analysis of the penultimate data freeze. Replication results were negative for all these loci. Several other loci for which there is weak evidence (p < 1×10−4
(BF) > 2.6) for association in our final data analysis are listed in Supplementary Table 14
To further investigate the potential role of CNVs as pathogenically relevant variants underlying published SNP-associations we took 94 association intervals in T1D, CD, and T2D (excluding the HLA), and for the index SNP in each association interval assessed its correlation with our calls at 3,432 CNVs. We identified two index SNPs as being correlated with an r2
of greater than 0.5 with a called CNV. The SNPs were: rs11747270 with both CNVR2647.1 and CNVR2646.1 (IRGM
), and rs2301436 and CNVR3164.1 (CCR6
), both for Crohn's disease. Both of these association intervals were also identified in an independent analysis using CNV calls on HapMap samples by Conrad et al
As a further test of our approach, we examined three multi-allelic CNVs which have attracted attention in the literature, both for the challenges of obtaining reliable data, and for putative associations with a range of autoimmune diseases: CCL3L1
(our CNVR7077.12); Beta-Defensins (CNVR3771.10), and FCGR3A/B
. Encouragingly, all three CNVs pass QC and give good data. shows cluster plots for these CNVs in our experiment. The best calls for the three CNVs required the use of two analysis pipelines (sets of choices about normalisation and probe summaries) different from our standard pipeline. None of the CNVs shows significant association with the three autoimmune diseases in our study after allowance for multiple testing. In particular, we do not see formally significant evidence to replicate the reported association for CCL3L1
and rheumatoid arthritis 24
(nominal p = 0.058).
We also assessed whether CNVs which delete all or part of exons might be enriched amongst disease susceptibility loci, even if our study were not well-powered enough to see statistically significant evidence of association for individual CNVs. To do so, we compared the 53 exonic deletion CNVs 12
which passed QC with collections of CNVs of the same size, matched for MAF and numbers of classes. We used a (two-sided) Wilcoxon signed-rank test 25
to ask whether the strength of signal for association (measured by Bayes Factors) was systematically different for the exon-deletion CNVs as compared to the matched CNVs. We found no evidence that deletion of an exon systematically changed evidence for association (see SoM). In a related analysis, we compared CNVs passing QC which were well tagged by SNPs (r2
> 0.8) to those passing QC which were not, again matching for MAF and number of classes (excluding low MAF CNVs and those failing Hardy-Weinberg equilibrium tests to avoid calling artefacts). There was no evidence that CNVs passing QC which are not well tagged by SNPs are enriched for stronger signals of association compared to those which were well tagged (see SoM).