The average number of CNVs detected per experiment was 70 and 24 for the WGTP and 500K EA platforms, respectively (Supplementary Tables 8
). Due to the nature of the comparative analysis, each WGTP experiment detects CNVs in both test and reference genomes, whereas each 500K EA experiment detects CNV in a single genome. The median size of CNVs from the two platforms was 228 kb and 81 kb respectively, and the mean size was 341 kb and 206 kb. Consequently, the average length of the genome shown to be copy number variable in a single experiment is 24 Mb and 5 Mb on the WGTP and 500K EA platforms, respectively. The larger median size of the WGTP CNVs partially reflects inevitable overestimation of CNV boundaries on a platform comprising large-insert clones, as CNV encompassing only a fraction of a clone can be detected, but will be reported as if the whole clone was involved.
By merging overlapping CNVs identified in each individual, we delineated a minimal set of discrete copy number variable regions (CNVRs) among the 270 samples (, Supplementary Table 11
). We identified 913 CNVRs on the WGTP and 980 CNVRs on the 500K EA platform and mapped their genomic distribution (). Approximately half of these CNVRs were called in more than one individual and 43% of all CNVs identified on one platform were replicated on the other. Combining the data resulted in total of 1,447 discrete CNVRs, covering 12% (~360Mb) of the human genome. Using locus-specific quantitative assays on a subset of regions we validated 173 (12%) of these CNVRs (Supplementary Tables 4 & 12
). A minority (30%) of these 1,447 CNVRs overlapped those identified in previous studies 1-3,5-8,29
. Combining different classes of experimental replication revealed that 957 (66%) of the 1,447 CNVRs detected here have either been replicated on both WGTP and 500K EA platforms, or with a locus-specific assay, or in another individual, or in a previous study (Supplementary Table 12
). Whole genome views of CNV show that while common, large-scale CNV is distributed in a heterogeneous manner throughout the genome (Supplementary Figure 6
), no large stretches of the genome are exempt from CNV () and the proportion of any given chromosome susceptible to CNV varies from 6% to 19% (Supplementary Figure 7
Defining copy number variable regions (CNVRs), copy number variants (CNVs) and CNV ends
Genomic distribution of copy number variable regions
Gaps within the reference human genome assembly have an extremely high likelihood of being associated with CNVs; out of the 345 gaps in the build 35 assembly, 48% (164/345) are flanked or overlapped by CNVRs. This finding highlights the complexity in generating a reference sequence in regions of structural dynamism and emphasizes the need for ongoing characterization of these genomic regions.
Comparing the CNVRs identified on the two platforms reveals that the WGTP and 500K EA platforms largely complement one another. The 500K EA platform is better at detecting smaller CNVs (Supplementary Figure 8
), whereas the WGTP platform has more power to detect CNVs in duplicated genomic regions (Supplementary Table 13
) where 500K EA coverage is poorer 30
Some CNVRs encompass two or more independent juxtaposed CNVs. For example, a small deletion found in one individual overlapping a much larger duplication in another individual was merged into a single CNVR, despite these representing distinct events. To delineate independent CNVs (CNV ‘events’) we applied more stringent merging criteria to separate juxtaposed CNVs (), and identified 1,116 and 1,203 CNVs on the WGTP and 500K EA platforms respectively ( and Supplementary Table 11
). We classified these CNVs into five types: (i) deletions, (ii) duplications, (ii) deletions and duplications at the same locus, (iv) multi-allelic loci and (v) complex loci whose precise nature was difficult to discern. Due to the inherently relative nature of these comparative data, it was impossible to determine unambiguously the ancestral state for most CNVs, and hence whether they are deletions or duplications. Here we adopted the convention of assuming that the minor allele is the derived allele 31
, thus deletions have a minor allele of lower copy number and duplications have a minor allele of higher copy number. Approximately equal numbers of deletions and duplications were identified on the WGTP platform, whereas deletions outnumbered duplications by approximately 2:1 on the 500K EA platform. In addition, 33 homozygous deletions (relative to the reference sequence) identified on the 500K EA platform were experimentally validated with locus-specific assays (Supplementary Table 14
). Most (27/33) of these have not been observed in a previous genome-wide survey of deletions 7
To investigate mechanisms of CNV formation, we studied the sequence context of sites of CNV. Non-allelic homologous recombination (NAHR) can generate rearrangements as a result of recombination between highly-similar duplicated sequences 32,33
. Segmental duplications are defined as sequences in the reference genome assembly sharing >90% sequence similarity over >1 kb with another genomic location 34,35
. We found that 24% of the 1,447 CNVRs were associated with segmental duplications, a significant enrichment (p<0.05). This association results from two factors: (i) rearrangements generated by NAHR and (ii) not all annotated segmental duplications are fixed in humans, but are, in fact, CNVs. This latter point highlights the essentially arbitrary nature of defining segmental duplications on the basis of a single genome sequence (albeit derived from several individuals).
The likelihood of a CNV being associated with segmental duplications depended on its length and its classification: multi-allelic CNVs, complex CNVs, and loci at which both deletions and duplications occurred were strikingly enriched for segmental duplications (, Supplementary Figure 9
). This is not surprising given the role that NAHR has been shown to play in generating complex structural variation 36
, arrays of tandem duplications that vary in size 37
and reciprocal deletions and duplications 38
The likelihood of a segmental duplication being associated with a CNV was greater for intra-chromosomal duplications than for inter-chromosomal duplications, and was highly correlated with increasing sequence similarity to its duplicated copy (Supplementary Figure 10
). NAHR is known to operate mainly on intra-chromosomal segmental duplications and to require 97-100% sequence similarity between duplicated copies 33,39
This role for NAHR in generating CNVs in duplicated regions of the genome is supported by the enrichment of segmental duplications within intervals that likely contain the breakpoints of the CNV (). We identified 88 CNVs from the 500K EA platform and 53 CNVs from the WGTP platform that contain a pair of segmental duplications, one at either end. These pairs of segmental duplications were biased towards high (>97%) sequence similarity, and were more frequently associated with the longest CNVs (Supplementary Figure 11
). In addition to segmental duplications, there are other types of sequence homologies that can promote NAHR, for example, dispersed repetitive elements, such as Alu
. We performed an exhaustive search for sequence homology of all kinds 41
and identified 121 CNVs from the 500K EA platform and 223 on the WGTP platform that contain lengths of perfect sequence identity longer than 100bp between either end of the CNV.