|Home | About | Journals | Submit | Contact Us | Français|
Genomic structural variants (SVs) are abundant in humans, differing from other variation classes in extent, origin, and functional impact. Despite progress in SV characterization, the nucleotide resolution architecture of most SVs remains unknown. We constructed a map of unbalanced SVs (i.e., copy number variants) based on whole genome DNA sequencing data from 185 human genomes, integrating evidence from complementary SV discovery approaches with extensive experimental validations. Our map encompassed 22,025 deletions and 6,000 additional SVs, including insertions and tandem duplications. Most SVs (53%) were mapped to nucleotide resolution, which facilitated analyzing their origin and functional impact. We examined numerous whole and partial gene deletions with a genotyping approach and observed a depletion of gene disruptions amongst high frequency deletions. Furthermore, we observed differences in the size spectra of SVs originating from distinct formation mechanisms, and constructed a map constructed a map of SV hotspots formed by common mechanisms. Our analytical framework and SV map serves as a resource for sequencing-based association studies.
Unbalanced structural variants (SVs), or copy number variants (CNVs), involving large-scale deletions, duplications, and insertions form one of the least well studied classes of genetic variation. The fraction of the genome affected by SVs is comparatively larger than that accounted for by single nucleotide polymorphisms1 (SNPs), implying significant consequences of SVs on phenotypic variation. SVs have already been associated with diverse diseases, including autism2,3, schizophrenia4,5 and Crohn’s disease6,7. Furthermore, locus-specific studies suggest that diverse mechanisms may form SVs de novo, with some mechanisms involving complex rearrangements resulting in multiple chromosomal breakpoints8,9.
Initial microarray-based SV surveys focused on large gains and losses10,11,12, with recent advances in array technology widening the accessible size spectrum towards smaller SVs1,13. Microarray-based commonly mapped SVs to approximate genomic locations. However, a detailed SV characterization, including analyses of SV origin and impact, requires knowledge of precise SV sequences. Advances in sequencing technology have enabled applying sequence-based approaches for mapping SVs at fine-scale14,15,16,17,18,19,20,21. These approaches include: (i) paired-end mapping (or read pair ‘RP’ analysis) based on sequencing and analysis of abnormally mapping pairs of clone ends14,22,23,24 or high-throughput sequencing fragments15,17,18; (ii) read-depth (‘RD’) analysis, which detects SVs by analyzing the read depth-of-coverage16,21,25,26,27; (iii) split-read (‘SR’) analysis, which evaluates gapped sequence alignments for SV detection28,29; and (iv) sequence assembly (‘AS’), which enables the fine-scale discovery of SVs, including novel (non-reference) sequence insertions30,31,32. Sequence-based SV discovery approaches have thus far been applied to a limited (<20) number of genomes, leaving the fine-scale architecture of most common SVs unknown.
Sequence data generated by the 1000 Genomes Project (1000GP) provide an unprecedented opportunity to generate a comprehensive SV map. The 1000GP recently generated 4.1 Terabases of raw sequence in pilot projects targeting whole human genomes33 (Supplementary Table 1). These studies comprise a population-scale project, termed ‘low-coverage project’, in which 179 unrelated individuals were sequenced with an average coverage of 3.6X – including 59 Yoruba individuals from Nigeria (YRI), 60 individuals of European ancestry from Utah (CEU), 30 of Han ancestry from Beijing (CHB), and 30 of Japanese ancestry from Tokyo (JPT; the latter two were jointly analyzed as JPT+CHB). In addition, a high-coverage project, termed the ‘trio project’, was carried out, with individuals of a CEU and a YRI parent-offspring trio sequenced to 42X coverage on average.
We report here the results of analyses undertaken by the Structural Variation Analysis Group of the 1000GP. The group’s objectives were to discover, assemble, genotype, and validate SVs of 50 bp and larger in size, and to assess and compare different sequence-based SV detection approaches. The focus of the group was initially on deletions, a variant class often associated with disease9, for which rich control datasets and diverse ascertainment approaches exist1,13,22,28. Less focus was placed on insertions and duplications34 and none on balanced SV forms (such as inversions). Specifically, we applied nineteen methods to generate an SV discovery set. We further generated reference genotypes for most deletions, assessed the SVs’ functional impact, and stratified SV formation mechanism with respect to variant size and genomic context.
We incorporated the SV discovery methods into a pipeline (Fig. 1AB), with the goal of ascertaining different SV types and assessing each method for its ability to discover SVs. The methods detected SVs by analyzing RD, RP, SR, and AS features, or by combining RP and RD features (abbreviated as ‘PD’). Altogether we generated thirty-six SV callsets by applying the methods on trio and low-coverage data, and by identifying SVs as genomic differences relative to a human reference, corresponding to the reference genome, or to a set of individuals (i.e. population reference; Supplementary Table 2). We initially identified SVs as deletions, tandem duplications, novel sequence insertions, and mobile element insertions (MEIs) relative to the human reference. Subsequent comparative analyses involving primate genomes enabled us to classify SVs as deletions, duplications, or insertions relative to inferred ancestral genomic loci, reflecting mechanisms of SV formation (see below). DNA reads analyzed by SV discovery methods were initially mapped to the human reference genome using a variety of alignment algorithms. Most of these algorithms mapped each read to a single genomic position, although one algorithm (mrFAST16) also considered alternative mapping positions for reads aligning onto repetitive regions (see Supplementary Tables 2-4 for method-specific parameters and full SV callsets). We filtered each callset by excluding SVs <50bp, which are reported elsewhere33. Many SVs exhibited support from distinct SV discovery methods, as exemplified by a common deletion, previously associated with body-mass index35 (BMI), that we identified with RP, RD, and SR methods (Fig. 1C). Nonetheless, we observed notable differences between methods (Fig. 2ABC) in terms of genomic regions ascertained (Supplementary Fig. 1), accessible SV size-range (Fig. 2A), and breakpoint precision (Fig.2C, Supplementary Fig. 2).
To estimate callset specificity, we carried out extensive validations (Methods), including PCRs for over 3,000 candidate loci, and microarray data analyses for 50,000 candidate loci (Supplementary Tables 3, 4; Supplementary Fig. 3). We combined PCR and array-based analysis results to estimate false discovery rates (FDRs), and found that eight callsets (three deletion, four insertion, and one tandem duplication callset) met the pre-specified specificity threshold33 (FDR≤10%), whereas the other callsets yielded lower specificity (FDRs of 13%-89%).
We further assessed the sensitivity of deletion discovery methods by collating data from four earlier surveys1,13,22,28 into a gold standard (Methods, Supplementary Tables 5, 6, and Supplementary Fig. 4A), and specifically assessing the detection sensitivity for an individual sequenced at high-coverage (NA12878) as well as for an individual sequenced at low-coverage (NA12156). Unsurprisingly, given the typical trade-off between sensitivity and specificity, in the trios the highest sensitivities were achieved by RD and RP methods with FDR>10% (Fig. 2B). By comparison, in the low-coverage data, the individual method with the greatest accuracy (FDR=3.7%) was the second most sensitive based on our gold standard (Fig. 2B), and the most sensitive when expanding the gold standard to a larger set of individuals (Supplementary Fig. 4B). This method, Genome STRiP (to be described elsewhere36), integrated both RP and RD features (PD), implying that considering different evidence types can improve SV discovery.
To construct our SV discovery set (“release set”), we joined calls from different discovery methods corresponding to the same SV with a merging approach that was aware of each callset’s precision in SV breakpoint detection (Supplementary Fig. 5 and Methods). Most SVs in the release set (61%) were contributed by individual methods meeting the pre-defined specificity threshold (FDR≤10%). The remaining 39% of calls were contributed by lower specificity methods following experimental validation. Altogether, the release set comprised 22,025 deletions, 501 tandem duplications, 5,371 MEIs, and 128 non-reference insertions (Table 1, Supplementary Table 7). With our gold standard we estimated an overall sensitivity of deletion discovery of 82% in the trios, and 69% in low-coverage sequence (Fig. 2B) using a 1 bp overlap criterion. When instead applying a stringent 50% reciprocal overlap criterion for sensitivity assessment (which required SV sizes inferred on different experimental platforms to be in close agreement) our sensitivity estimates decreased by 12% and 18%, respectively, in trio and low-coverage sequence (Supplementary Table 8). We further examined an alternative approach that involved the pairwise integration of deletion discovery methods, and tested its ability to discover SVs without relying on the inclusion of lower specificity calls following experimental validation (“algorithm-centric set”; Fig. 1B). While this alternative approach resulted in an increased number (by ~13%) of high-specificity (FDR<10%) calls compared to the release set (Supplementary Text), it overall resulted in fewer SV calls owing to its decreased sensitivity at the lower (<200bp) SV size range. In the following analyses we thus focused on the release set.
We next assessed the extent and impact of our SV discovery (release) set. The median SV size was 729 bp (mean=8 kb), approximately four times smaller than in a recent tiling CGH based study1, reflecting the high resolution of DNA sequence based SV discovery. We also compared our set to a recent survey of SVs in an individual genome37 based on capillary sequencing and array-based analyses24, and observed a similar size distribution for deletions, but differences in the size distributions of other SV classes, reflecting underlying differences in SV ascertainment (Supplementary Fig. 6). By comparing our SVs to databases of structural variation and to additional personal genome datasets, we classified 15,556 SVs in our set as novel, with an enrichment of low frequency SVs and small SVs amongst the novel variants (Methods and Supplementary Text).
A major advantage of sequence-based SV discovery is the nucleotide resolution mapping of SVs. We initially mapped the breakpoints of 7,066 deletions and 3,299 MEIs using SR and AS features. Using the TIGRA-targeted assembly approach38 we further identified the breakpoints of an additional 4,188 deletions and 160 tandem duplications, initially discovered by RD, RP, and PD methods (Methods, Supplementary Table 2). Altogether, we mapped ~15,000 SVs at nucleotide resolution, 48% of which were novel. Few deletion loci (4.4%) displayed different SV breakpoints in different samples, which is explainable by rare TIGRA misassemblies, or alternatively, by recurrently formed, multi-allelic SVs (Supplementary Text). TIGRA further enabled us to validate an additional 7,359 SVs discovered with RP or RD features by identifying the SVs’ breakpoints (Methods), and to evaluate the mapping precision of SV discovery methods (Fig. 2C, Supplementary Figure 2).
We further assessed the putative functional impact of SVs in our set by relating them to genomic annotation. Seventeen hundred SVs affected coding sequences, resulting in full gene overlaps or exon disruptions (Table 2), many of which led to out-of-frame exons (Supplementary Table 9). We related gene disruptions to gene functions, and observed significant enrichments for several functional categories including cell defense and sensory perception (Supplementary Table 10). High levels of structural variation, including copy-number variation, were previously described for both processes15,22,39. These SVs might be maintained in the population by selection for the purpose of functional redundancy. While most SVs intersecting with genes were deletions, several validated tandem duplications and MEIs also intersected with coding sequences (Table 2).
We next sought to generate genotypes for deletions discovered in the 1000GP data, both to facilitate population genetics analyses and to make our SV set amenable to association studies in the form of a reference genotype set. In this regard, the Genome STRiP36 genotyping method was developed, a method combining information from RD, RP, SR and haplotype features of population-scale sequence data for genotyping (Methods, Supplementary Text). Using this approach we generated genotypes for 13,826 autosomal deletions in 156 individuals. The genotypes displayed 99.1% concordance with CGH array1 based genotypes (available for 1,970 of the deletions), suggesting high genotyping accuracy.
Fig. 3 presents allele frequency analyses based on these genotypes. As expected, common polymorphisms (minor allele frequency (MAF) >5%) were generally shared across populations, while rare alleles were frequently observed in only one population (Figs. 3ABC). We observed several candidates for monomorphic deletions (i.e., genomic segments putatively deleted in all individuals), explainable by rare insertions present in the reference genome or by remaining genotyping inaccuracies (Supplementary Text).
We next assessed the allele frequencies of gene deletions (Fig. 3D). Similar to a recent array-based study1, we observed a depletion of high frequency alleles among deletions intersecting with protein-coding sequence compared to other deletions (P=1.1×10−11; KS test), consistent with purifying selection keeping most gene deletions at low frequency. Nonetheless, several coding sequence deletions were observed with high allele frequency (>80%). Most of these occurred in regions annotated as segmental duplications, consistent with lessened evolutionary constraintin functionally redundant gene categories22. Intriguingly, common gene deletions also affected many unique genes with no obvious paralogs. We further analyzed the abundance of gene deletions in different populations and observed highly differentiated loci, albeit with no statistically significant relationship between differentiation and particular categories of gene overlap, i.e., intronic vs. exonic (Supplementary Text).
By comparing deletion genotypes with genotypes of nearby SNPs, we found, consistent with earlier studies1,13,40, that deletions in genomic regions accessible to short read sequencing display extensive linkage disequilibrium (LD) with SNPs. 81% of common deletions had one or more SNPs with which they are strongly correlated (r2>0.8; Supplementary Fig. 7). This suggests that many deletions mapped in our study will be identifiable through tagging SNPs in future studies (Supplementary Text). On the other hand, a fifth of the genotyped deletions were not tagged by HapMap SNPs (a figure similar to the fraction of SNPs that are not tagged by HapMap SNPs41), implying that these SVs should be genotyped directly in association studies. Furthermore, the LD properties of complex SVs (e.g., multiallelic SV) have not yet been fully ascertained as methods for genotyping such SVs with similar accuracy still being developed.
Nucleotide resolution breakpoint information enables inference of SV formation mechanisms15,22. Recent studies broadly distinguished between several germline rearrangement classes, some of which may comprise more than one SV formation mechanism15,22,42,43: non-allelic homologous recombination (NAHR), associated with long sequence similarity stretches around the breakpoints; rearrangements in the absence of extended sequence similarity (abbreviated as “non-homologous” or NH), associated with DNA repair by non-homologous end-joining (NHEJ) or with microhomology-mediated break-induced replication (MMBIR); the shrinking or expansion of variable number of tandem repeats (VNTRs), frequently involving simple sequences, by slippage; and MEIs. We distinguished among the classes NAHR, NH, VNTR, and MEI by examining the breakpoint junction sequence of SVs initially discovered as deletions or tandem duplications relative to a human reference.
We first compared the SVs to orthologous primate genomic regions to distinguish deletions from insertions/duplications with respect reconstructed ancestral loci using the BreakSeq classification approach43. This analysis showed that of the 11,254 nucleotide-resolution SVs discovered as deletions relative to a human reference, 21% actually represented insertion and 2% represented tandem duplications relative to the putative ancestral genome. Of the remaining SVs, 60% were classified as deletions relative to ancestral sequence, whereas the ancestral state of 17% was undetermined. By comparison, out of 160 nucleotide-resolution SVs identified as tandem duplications relative to the reference genome, 91.6% were classified as duplications relative to the ancestral genome, whereas the ancestral state of 8.4% remained undetermined (Supplementary Text). Our breakpoint analysis revealed that 70.8% of the deletions and 89.6% of the insertions exhibited breakpoint microhomology/homology ranging from 2-376 bp in size, with distribution modes of 2 bp (attributable to NH) and 15 bp (attributable to MEI), respectively (Fig. 4A, Supplementary Text). As expected42, a small portion of the deletions (16.1%) displayed non-template inserted sequences at their breakpoint junctions. By comparison, the tandem duplications showed extensive stretches displaying ≤95% sequence identity at the breakpoint linearly correlating in lenght with SV size (Fig. 4A). In addition, most tandem duplications displayed 2-17 bp of microhomology at the breakpoint junctions (Supplementary Text).
We subsequently applied BreakSeq43 to infer formation mechanisms for all SVs classified with regard to ancestral state. Using BreakSeq, we inferred NH as the dominating deletion mechanism, and MEI as the dominating insertion mechanism (Fig. 4BC, Supplementary Table 11). Furthermore, an abundance of microhomology at tandem duplication breakpoints suggested frequent formation of this SV class by a rearrangement process acting in the absence of homology (NH). When relating SV formation to the variant size spectrum, we observed marked insertion peaks for MEIs at 300 bp, corresponding to Alu elements, and at 6 kb, corresponding to L1/LINEs (Fig. 4C). By comparison, NH and NAHR based mechanisms occurred across a wide size-range, whereas VNTR expansion/shrinkage, consistent with earlier findings1, led to relatively small SV sizes (Figs. 4C,D).
Furthermore, when displaying the genomic distribution of SVs (Fig. 5A), we observed a notable clustering of SVs into ’SV hotspots’. We analyzed this clustering in detail by examining the distribution of non-overlapping, adjacent SVs, and observed a marked clustering of SVs formed by NAHR, VNTR, and NH, respectively, a signal extending to hundreds of kilobases (Fig.5B). The clustering was influenced by an abundance of VNTR near the centromeres43 and NAHR near the telomeres (Fig.5A). A significant enrichment of NAHR near recombination hotspots (P=1.3e-15) and segmental duplications (P=3.1e-17) further contributed to the clustering (Supplementary Table 13).
To further explore this clustering we devised a segmentation approach for predicting SV hotspots (Methods), which yielded a map of 51 putative SV hotspots (Supplementary Table 14). 80% of the hotspots mainly comprised SVs originating from a single formation mechanism (Fig. 5C). Most of these corresponded to NAHR hotspots, although hotspots dominated by NH and VNTR also were evident. These observations suggest that SV formation is frequently associated with the locus-specific propensity for genomic rearrangement.
By generating an SV set of unprecedented size along with breakpoint assemblies and reference genotypes, we demonstrate the suitability of population-scale sequencing for SV analysis. Nucleotide resolution data allow the construction of reference datasets and make SVs readily assessable across different experimental platforms using genotyping approaches. Our fine-scale map enabled us to examine the functional impact of SVs, as exemplified by our analysis of gene disruption variants, which will be of value for genome and exome sequencing studies.
Our map further enabled us to examine size spectra of SV formation mechanisms and led us to identify genomic SV hotspots that are commonly dominated by a single formation mechanism. Recurrent rearrangements, implicated in genomic disorders, are hypothesized to be associated with local genome architecture44, e.g., with segmental duplications that facilitate NAHR. Also, DNA rearrangement in the absence of homology, i.e., MMBIR, has been implicated in recurrent SV formation8,45. In this regard, we noticed that out of the hotspots we report, six fall into critical regions of known genetic disorders associated with recurrent de novo deletions, including Miller-Dieker syndrome and Leri-Weill dyschondrosteosis (Supplementary Table 14). Irrespective of potential disease relevance, or inferred mechanism of formation, our analysis revealed a map of SV hotspots that may constitute local centers of de novo SV formation, consistent with the concept that local genome architecture contributes to genomic instability44.
Our study focused on characterizing deletions, which are often associated with disease9. Facilitated by ancestral analyses of SV loci, we also characterized insertions and tandem duplications, albeit in less detail than deletions. Companion papers with more detailed analyses of MEIs, and copy-number variation within segmental duplications are published elsewhere34,46. Of note, most SV discovery methods depend on mapping reads onto their genomic locus of origin, i.e., the ‘accessible’ fraction of the genome, a fraction lessened in segmental duplications that are of high interest to SV analysis. Nonetheless, owing to the abilities of RP and RD methods in detecting SVs in these regions and in interpreting reads with multiple mapping positions, the ‘accessible’ fraction of the genome is higher for SVs than for SNPs16. In the future, sequencing technologies generating longer DNA reads will increase the accessible genome, and will enable the assessment of SVs embedded in long repeat structures, such as balanced inversions.
Our SV resource will enable the discovery, genotyping, and imputation of SVs in larger cohorts. Numerous genomes will be sequenced in the coming months to facilitate disease association studies. Systematic characterization of SVs in these genomes will benefit from the concepts and datasets presented here.
Sequence data for 179 unrelated individuals and six individuals from parent-offspring trios were obtained as part of the 1000GP. These data were generated with Illumina/Solexa, Roche/454, and Life Technologies/SOLiD sequencing technology platforms.
The SV discovery methods we applied comprised six RP, four RD, three SR, four AS, and two PD based methods. TIGRA38 was used for targeted breakpoint assembly.
We validated SV calls by PCR, array CGH and SNP microarrays, targeted assembly, and custom microarray-based sequence capture. PCR was performed in various different laboratories33, CGH analysis was performed based on tiling array data provided by the Genome Structural Variation Consortium (ArrayExpress: E-MTAB-40), and SNP array analysis based on data obtained from the International HapMap Consortium (http://hapmap.ncbi.nlm.nih.gov).
Genome STRiP36 was used for deletion genotyping in low coverage sequence data. Initial genotype likelihoods were derived with a Bayesian model and imputation into a SNP genotype reference panel from the HapMap41 (Hapmap3r2) was achieved with Beagle (v3.1; http://faculty.washington.edu/browning/beagle/beagle.html).
SV breakpoints mapped at nucleotide resolution were analyzed with BreakSeq43 to classify SVs relative to putative ancestral loci and to infer SV formation mechanisms. SV hotspots were mapped with custom Perl and R scripts.
We would like to acknowledge Claire Hardy, Richard Smith, Anniek De Witte, and Shane Giles for their assistance with validation. M.A.B’s group was supported by grants from the National Institutes of Health (RO1 GM59290) and G.T.M’s group by grants R01 HG004719 and RC2 HG005552, also from the NIH. J.O.K.’s group was supported by an Emmy Noether Fellowship of the German Research Foundation (Deutsche Forschungsgemeinschaft). J.W.’s group was supported by the National Basic Research Program of China (973 program no. 2011CB809200), the National Natural Science Foundation of China (30725008; 30890032; 30811130531; 30221004), the Chinese 863 program (2006AA02Z177; 2006AA02Z334; 2006AA02A302; 2009AA022707), the Shenzhen Municipal Government of China (grants JC200903190767A; JC200903190772A; ZYC200903240076A; CXB200903110066A; ZYC200903240077A; ZYC200903240076A and ZYC200903240080A), and the Ole RØmer grant from the Danish Natural Science Research Council. C.L.’s group was supported by grants from the National Institutes of Health: P41 HG004221, RO1 GM081533, and UO1 HG005209 and X.S. was supported by a T32 fellowship award from the NIH. We thank the Genome Structural Variation Consortium (http://www.sanger.ac.uk/humgen/cnv/42mio/) and the International HapMap Consortium for making available microarray data. The authors acknowledge the individuals participating in the 1000 Genomes Project by providing samples, including The Yoruba people of Ibadan, Nigeria, the community at Beijing Normal University, the people of Tokyo, Japan, and the people of the Utah CEPH community. Furthermore, we thank Richard Durbin and Lars Steinmetz for comments on the manuscript.
Competing interests statement: The authors declare competing financial interests. H.E.P. and Y.F. are employees of Life Technologies, the manufactures of the SOLiD sequencing platform. R.K.C. is an employee of Illumina Cambridge Ltd., the manufacturer of the Illumina sequencing platform.
Data retrieval: The data sets described here can be obtained from the 1000 Genomes Project website at www.1000genomes.org (July 2010 Data Release). Individual SV discovery methods can be obtained from sources mentioned in Supplementary Table 1, or upon request from the authors. Abbreviations used in this paper are summarized in the Supplementary Text.