Structural variation encapsulates a heterogeneous mix of variants arising by different mutational mechanisms. This heterogeneity necessitates further subclassification. Structural variants are typically subdivided into those that result in a change in DNA dosage (copy number variants (CNVs)) and those that do not (inversions and balanced translocations). Moreover, loci with variable copy numbers have a direction of change, deletion or duplication and can be biallelic or multiallelic. Thus, biallelic deletion loci have a diploid copy number of 0, 1 or 2, representing the three possible genotypes, whereas biallelic duplications generally have a diploid copy number of 2, 3 or 4 (). Multiallelic CNVs can result from deletions and duplications at the same locus and frequently involve tandemly repeated arrays of duplicated sequences. In the case of the gene FCGR3B
, multiallelic copy number variation results in a diploid copy number of 0, 1, 2, 3 or 4 (refs. 10,11
), but for the Y-linked gene TSPY
, the copy number in males ranges from 23–64 (ref. 12
). The complexity of structural variation is further underlined by the existence at some loci of alleles that differ by multiple structural changes13,14
Figure 1 Diploid copy numbers, corresponding CNV genotypes and the underlying quantitative data from an array CGH experiment. Top, biallelic CNV; bottom, multiallelic CNV (data from ref. 11). Note that there is not a 1:1 mapping of diploid copy numbers to CNV (more ...)
Demonstrating heritability is the sine qua non
of all genetic studies, and it is only recently that the heritability of large numbers of structural variants has been demonstrated11,15
. Observing mendelian inheritance of markers in pedigrees is the traditional method for assessing the heritability of genetic variants, but the frequent inability to attribute numbers of copies to each allele (a diploid copy number of 2 could represent either a 1/1 or 2/0 genotype; see ) can create something of a problem; however, treating CNV data as quantitative traits (Supplementary Fig. 1
online) allows the heritability of all types of CNV to be demonstrated15
Perhaps the most comprehensive catalog of known structural variation is the Database of Genomic Variants (DGV; http://projects.tcag.ca/variation/
), which currently contains results from 37 publications, representing a bevy of experimental and analytical approaches to detecting structural variation. Combining information from different experiments in a meaningful way is challenging: choice of technique, genome assembly and reference sample(s) all frustrate meta-analysis of existing structural variation data. At the time of writing, there were 3,966 entries in the DGV (3,889 CNVs and 77 inversions or inversion breakpoints; see ) at 2,191 loci, covering a staggering 405 Mb (14%) of the genome. The size distribution of CNV loci in the DGV ranges from 1 kb to 3.89 Mb, with a median of 103 kb. Almost certainly there are a nontrivial number of false positives in the DGV, and individual variants do not come with any measure of validity. Moreover, the sensitivity of the technology is such that when using large-insert clones as microarray probes, a CNV can be detected even if only a minority of the clone is copy number variable, and as a result, the size of a CNV can be overestimated.
Figure 2 Cumulative number of RefSNP entries in dbSNP and cumulative number of variant loci in the Database of Genomic Variants, plotted as a function of time. Axes have been scaled differently to enhance visualization. RefSNP entries: left axis, blue (M = million). (more ...)
Current technologies allow assessment of medium-to-large structural variation across almost all of the euchromatic human genome16
. CNVs detected thus far are not randomly distributed across the genome but are preferentially clustered near centromeres and telomeres, regions known to be enriched with segmental duplications11,17
Thus far, a limited number of populations have been represented in genome-wide CNV studies. Although the populations sampled by the International HapMap Project4
(European ancestry, Yoruba from Nigeria, Han Chinese, Japanese) are the most thoroughly characterized with respect to CNV11,15,18,19
, several studies have typed small samples from additional populations such as Native Americans and Pacific Islanders17,20,21
. Although the HapMap samples seem to be representative of global SNP variation5
, there will be a benefit to sampling structural variation from a broader set of populations. Careful planning and description of population sampling will greatly improve the utility of future data sets of genome-wide structural variation.
Clearly, these are the early stages of structural genomic research (). Based on genome comparisons22
and analysis of small indels23,24
and large polymorphic deletions18
, it is evident that the length distribution of copy number variation is approximately exponential, with many small variants and few large ones. Small structural variants (1–10 kb) are the most underascertained, as they are difficult to discover with most existing platforms. Owing to the experimental difficulties of detecting balanced rearrangements, this class of variation is also largely unstudied. Cytogenetic work has estimated that a balanced translocation is formed in at least 1 of 2,000 concepti25
, and structural variation in subtelomeres is also known to be extensive26
. Thus far, the most polymorphic inversions have been identified by comparison of pairs of genomes characterized in detail27–29
. As the number of genomes screened for inversions increases, we should expect to see a rapid increase in the number of known inverted sequences.
Existing technologies used to survey genome-wide copy number variation have limited the ability to characterize the breakpoints of a CNV as resolution is sacrificed for coverage, and consequently, breakpoints for a given CNV typically can be mapped with a resolution of only 10–100 kb11
. Without sequencing-level resolution, it is difficult to establish whether two alleles with indistinguishable structures stem from the same or different ancestral mutation events. Resolving this ambiguity facilitates the incorporation of structural variants into standard genetic analyses, which use the genotype as the core currency. Analysis methods for quantitative data (for example, array-based comparative genome hybridization (CGH)) typically identify CNVs as outliers against a background of invariant loci in the same genomes; however, the resultant set of CNV ‘calls’ cannot be considered as a reliable proxy for genotypes. At a minority of CNVs, the quantitative data can be used to cluster individuals into discrete classes that for biallelic CNVs correspond to the three possible genotypes (); however, for multiallelic CNVs, which constitute a sizeable fraction of large CNVs11
, it is not possibly to translate the diploid copy number into a genotype. The prospect of targeted assays for previously identified CNVs promises to dramatically increase the proportion of biallelic CNVs that can be genotyped unambiguously30
The ancestral state of a variant is of great importance in population genetics, as it establishes the direction of change and is usually assigned on the basis of comparisons to closely related species. For structural variation, this is complicated by the fact that many sites of structural variation in the human genome are also structurally variable in the chimpanzee genome31
; however, if ancestral states could be determined for large numbers of structural variants (by analyzing their haplotypic background in humans12,32
or by studying more outgroup species), subsequent population genetic analysis would be greatly facilitated.