Human geneticists have long sought to know the extent of genetic variation and here, in the most comprehensive analysis to date, we present the latest estimates of greater than 1% within an individual genome. Using multiple computational and experimental approaches, this study substantially expands on the SV map initially constructed by Levy and colleagues [1
]; more than 80% of the total 48,777,466 structurally variable bases have not been reported from the original sequencing of the Venter genome.
Our study here differs from previous studies in many ways. Our mate-pair approach makes use of multiple different clone insert sizes, ranging from 2 to 37 kb, and this enables us to detect a wide size range of variants compared to previous paired-end mapping focused studies [15
]. Furthermore, the long sequence reads used here increase alignment accuracy, and enable the identification of intra-alignment gaps. Using microarrays, we are able to identify large size variants that can be challenging to identify by sequencing.
Furthermore, our results highlight that each variation-discovery strategy has limitations and that no single approach can capture the entire spectrum of genetic variation, thus emphasizing the importance of applying multiple strategies in SV detection. Figure shows that the variation distribution of other personal genome sequencing studies, which relied almost exclusively on NGS technology, is substantially lower than the Venter annotation across many size ranges.
There are still some regions, such as heterochromatin (Additional file 18
) and highly identical segmental duplication regions, where all of the current approaches have limited detection capabilities. To prevent false discovery, we have used stringent alignment criteria, excluded alignments to multiple high-identity sequences, and will therefore likely miss variants within or flanking these sequences. Insufficient probe coverage and low intensity ratio fold-change also prevent microarrays from capturing CNV of highly repetitive sequences (for example, Alu elements). As such, we suspect there will be more variants to be discovered, but their ascertainment will require specialized experimental [18
] and algorithmic [29
] approaches. Further increases in read-depth can yield new variants. Indeed, the greatest relative number of SVs discovered in Venter is in the 10-kb size range (Figure ), corresponding to the interval with the highest clone coverage [1
] (Additional file 2
). As expected, our results also show that using several libraries with different insert size leads to increased variation discovery.
The importance of SV to gene expression (direct and indirect) [32
], protein structure [33
], and chromosome stability [34
] is being increasingly recognized in normal development and disease [9
]. At the same time we show that SVs are: 1, grossly under-represented in published NGS sequencing projects; 2, not always imputable by SNP-based association; 3, ubiquitous along chromosomes impacting all known functional genomic features; and 4, often large, complex, and under negative or purifying selection [19
]. Coupling these observations with conjectures that prophylactic decisions will be best informed by higher-penetrance rare alleles [10
] and that common SNPs explain only a proportion of heritability [37
] argue persuasively that SVs should gain more prominence in genomic medicine.