|Home | About | Journals | Submit | Contact Us | Français|
Population genetics is central to our understanding of human variation, and by linking medical and evolutionary themes, it enables us to understand the origins and impacts of our genomic differences. Despite current limitations in our knowledge of the locations, sizes and mutational origins of structural variants, our characterization of their population genetics is developing apace, bringing new insights into recent human adaptation, genome biology and disease. We summarize recent dramatic advances, describe the diverse mutational origins of chromosomal rearrangements and argue that their complexity necessitates a re-evaluation of existing population genetic methods.
Although it has long been appreciated that the human genome contains a size continuum of genomic variation ranging from single-nucleotide changes to large (>3 Mb), microscopically visible karyotypic alterations, only recently has the abundance of structural variation between these two size extremes been appreciated. Structural variation has been defined as genomic alteration involving segments of DNA longer than 1 kb1. These segments can be deleted, duplicated, inserted, inverted in orientation or translocated.
Genomic variants of all sizes and types can contribute to genetic disease, and all are potential substrates for natural selection resulting in phenotypic differences between individuals, populations and species. Investigating the medical and evolutionary impact of structural variation requires that we understand the distribution of such variation within a species and the factors influencing that variation: in other words, the population genetics of structural variation.
The general factors influencing the distribution of variation within a species are common to all classes of variant and include mutation, selection, genetic drift, recombination, migration and population demography2. Although mutational mechanisms are sufficiently common across species such that structural variation is likely to be a general feature of all genomes (to a greater or lesser extent), here we confine ourselves to variation within the human genome.
Each genetic variant has its own specific evolutionary history, but it is through the analysis of many variants that the general properties of a class of variation can be elucidated. Population genetics is concerned with both variant-specific histories and the general properties of variation, both of which are pertinent to medical and evolutionary issues. Variants that seem to be population genetic outliers relative to a background of ‘normal’ variation are often enriched for medically relevant mutations. For example, the frequency at which an alpha-globin gene is deleted within a population varies between geographical locations to an unusually high degree. This is because the deletion confers both resistance to malarial infection and susceptibility to mild thalassemia: thus, it has increased in frequency in regions in which malaria is endemic3 but remains at low frequency in the absence of malaria.
Over the past two decades, SNPs4,5, microsatellites6 and minisatellites7 have been characterized extensively in different human populations, and as a result of population genetic analyses, we have learned much about the recent origin, dispersals and demography of our species and the different mutational and recombinational8 processes generating and shuffling such variation. Moreover, we can use this information to begin to identify variants conferring risk to common diseases4. It has also been possible to identify specific variants that have conferred a selective advantage to our ancestors9.
Population-genetic studies of SNPs, microsatellites and minisatellites have been a two-step process owing to economic constraints. They involve a discovery phase in which variants are identified in a limited set of individuals, followed by a targeted genotyping phase in which these variants are genotyped in diverse populations. This strategy also holds true for structural variation and introduces significant complications and biases for population genetic analyses.
Most structural variants have been discovered only in the past two years, and as a result, the population genetics of structural variation is very much in its infancy. Having outlined the importance of a population genetic perspective above, we now explore what we presently know about the distribution of structural variation and look toward the interesting questions that we can begin to address.
Structural variation encapsulates a heterogeneous mix of variants arising by different mutational mechanisms. This heterogeneity necessitates further subclassification. Structural variants are typically subdivided into those that result in a change in DNA dosage (copy number variants (CNVs)) and those that do not (inversions and balanced translocations). Moreover, loci with variable copy numbers have a direction of change, deletion or duplication and can be biallelic or multiallelic. Thus, biallelic deletion loci have a diploid copy number of 0, 1 or 2, representing the three possible genotypes, whereas biallelic duplications generally have a diploid copy number of 2, 3 or 4 (Fig. 1). Multiallelic CNVs can result from deletions and duplications at the same locus and frequently involve tandemly repeated arrays of duplicated sequences. In the case of the gene FCGR3B, multiallelic copy number variation results in a diploid copy number of 0, 1, 2, 3 or 4 (refs. 10,11), but for the Y-linked gene TSPY, the copy number in males ranges from 23–64 (ref. 12). The complexity of structural variation is further underlined by the existence at some loci of alleles that differ by multiple structural changes13,14.
Demonstrating heritability is the sine qua non of all genetic studies, and it is only recently that the heritability of large numbers of structural variants has been demonstrated11,15. Observing mendelian inheritance of markers in pedigrees is the traditional method for assessing the heritability of genetic variants, but the frequent inability to attribute numbers of copies to each allele (a diploid copy number of 2 could represent either a 1/1 or 2/0 genotype; see Fig. 1) can create something of a problem; however, treating CNV data as quantitative traits (Supplementary Fig. 1 online) allows the heritability of all types of CNV to be demonstrated15.
Perhaps the most comprehensive catalog of known structural variation is the Database of Genomic Variants (DGV; http://projects.tcag.ca/variation/), which currently contains results from 37 publications, representing a bevy of experimental and analytical approaches to detecting structural variation. Combining information from different experiments in a meaningful way is challenging: choice of technique, genome assembly and reference sample(s) all frustrate meta-analysis of existing structural variation data. At the time of writing, there were 3,966 entries in the DGV (3,889 CNVs and 77 inversions or inversion breakpoints; see Fig. 2) at 2,191 loci, covering a staggering 405 Mb (14%) of the genome. The size distribution of CNV loci in the DGV ranges from 1 kb to 3.89 Mb, with a median of 103 kb. Almost certainly there are a nontrivial number of false positives in the DGV, and individual variants do not come with any measure of validity. Moreover, the sensitivity of the technology is such that when using large-insert clones as microarray probes, a CNV can be detected even if only a minority of the clone is copy number variable, and as a result, the size of a CNV can be overestimated.
Current technologies allow assessment of medium-to-large structural variation across almost all of the euchromatic human genome16. CNVs detected thus far are not randomly distributed across the genome but are preferentially clustered near centromeres and telomeres, regions known to be enriched with segmental duplications11,17.
Thus far, a limited number of populations have been represented in genome-wide CNV studies. Although the populations sampled by the International HapMap Project4 (European ancestry, Yoruba from Nigeria, Han Chinese, Japanese) are the most thoroughly characterized with respect to CNV11,15,18,19, several studies have typed small samples from additional populations such as Native Americans and Pacific Islanders17,20,21. Although the HapMap samples seem to be representative of global SNP variation5, there will be a benefit to sampling structural variation from a broader set of populations. Careful planning and description of population sampling will greatly improve the utility of future data sets of genome-wide structural variation.
Clearly, these are the early stages of structural genomic research (Fig. 2). Based on genome comparisons22 and analysis of small indels23,24 and large polymorphic deletions18, it is evident that the length distribution of copy number variation is approximately exponential, with many small variants and few large ones. Small structural variants (1–10 kb) are the most underascertained, as they are difficult to discover with most existing platforms. Owing to the experimental difficulties of detecting balanced rearrangements, this class of variation is also largely unstudied. Cytogenetic work has estimated that a balanced translocation is formed in at least 1 of 2,000 concepti25, and structural variation in subtelomeres is also known to be extensive26. Thus far, the most polymorphic inversions have been identified by comparison of pairs of genomes characterized in detail27–29. As the number of genomes screened for inversions increases, we should expect to see a rapid increase in the number of known inverted sequences.
Existing technologies used to survey genome-wide copy number variation have limited the ability to characterize the breakpoints of a CNV as resolution is sacrificed for coverage, and consequently, breakpoints for a given CNV typically can be mapped with a resolution of only 10–100 kb11. Without sequencing-level resolution, it is difficult to establish whether two alleles with indistinguishable structures stem from the same or different ancestral mutation events. Resolving this ambiguity facilitates the incorporation of structural variants into standard genetic analyses, which use the genotype as the core currency. Analysis methods for quantitative data (for example, array-based comparative genome hybridization (CGH)) typically identify CNVs as outliers against a background of invariant loci in the same genomes; however, the resultant set of CNV ‘calls’ cannot be considered as a reliable proxy for genotypes. At a minority of CNVs, the quantitative data can be used to cluster individuals into discrete classes that for biallelic CNVs correspond to the three possible genotypes (Fig. 1); however, for multiallelic CNVs, which constitute a sizeable fraction of large CNVs11, it is not possibly to translate the diploid copy number into a genotype. The prospect of targeted assays for previously identified CNVs promises to dramatically increase the proportion of biallelic CNVs that can be genotyped unambiguously30.
The ancestral state of a variant is of great importance in population genetics, as it establishes the direction of change and is usually assigned on the basis of comparisons to closely related species. For structural variation, this is complicated by the fact that many sites of structural variation in the human genome are also structurally variable in the chimpanzee genome31; however, if ancestral states could be determined for large numbers of structural variants (by analyzing their haplotypic background in humans12,32 or by studying more outgroup species), subsequent population genetic analysis would be greatly facilitated.
Population geneticists recognize that the ascertainment scheme of variants in a discovery phase strongly influences the inferences drawn from data gathered during a subsequent targeted population-screening phase33. For example, to minimize the effort wasted on genotyping monomorphic markers, the ascertainment of SNPs included in Phase I of the HapMap was strongly biased toward SNPs observed more than once in a small discovery panel4. This strongly skews the site frequency spectrum toward common variants, and as a result, it biases estimates of linkage disequilibrium (LD) and many other population genetic statistics34. Similarly, any such two-phase study of structural variation will need to be corrected for ascertainment-induced biases, but documenting the ascertainment in detail facilitates these corrections.
Currently ascertained structural variants (CNVs in particular) have additional biases. First, only the largest variants have been discovered thus far; second, deletions are typically easier to detect than duplications; and third, there are biases in genomic location owing to incomplete genomic coverage in many surveys. Many of these biases differ markedly between different surveys for structural variation. Therefore, making general inferences about structural variation from current data is fraught with complications. In particular, the size of a CNV is highly correlated with many other features of the CNV, so the present skew toward larger variants could result in misleading inferences if they are considered representative of all CNVs. To give one example, longer CNVs are much more likely to be associated with segmental duplications than shorter CNVs11,18,27, so the role of segmental duplications in generating all CNVs may be overestimated from the known CNVs.
The dependence on a reference genome assembly for data analysis (for example, fosmid paired-end analysis) or experimental design (for example, array CGH) also introduces biases. For instance, polymorphic sequences that are deleted in the reference genome assembly will not be detected by current array-based methodologies. Although the reference genome assembly is derived from many individuals, these contributions are not representative of global diversity. As approximately half the remaining gaps in the current genome assembly are associated with CNVs11, the continued refinement of the genome assembly should yield improved understanding of structural variation.
The mutational processes that lead to structural variation are diverse and, perhaps unsurprisingly given the low-resolution mapping of most structural variation breakpoints, poorly characterized. Many studies investigating recurrent rearrangements that cause ‘genomic disorders’ have identified breakpoints embedded within highly similar duplicated sequences35 (including both dispersed repetitive elements (for example, Alu sequences) and segmental duplications). This has led to an appreciation of the role of meiotic nonallelic homologous recombination (NAHR) in the genesis of many large rearrangements. NAHR between direct repeats causes deletions and duplications, NAHR between inverted repeats produces inversions and NAHR between repeats on different chromosomes leads to translocations. Moreover, these NAHR events can occur at rates of up to 10−4 per generation36; microsatellites and SNPs typically have mutation rates of ~10−3 and ~10−8 per generation, respectively. Segmental duplications are also enriched around CNVs11 and inversions27, thus implicating NAHR in the genesis of some structural variants.
NAHR is not the only mechanism generating structural variation; indeed, even for the largest CNVs, NAHR can account for only a minority of mutational events. Moreover, the smaller the CNV, the less likely that NAHR is involved11,18,27. Non–homology based mutation mechanisms must be responsible for the majority of structural variants. Nonhomologous end joining (NHEJ) is an alternative process by which DNA double-strand breaks are repaired, and it is likely that it has a substantial role in generating structural variation. In contrast to NAHR, NHEJ events are rarely recurrent, which suggests that they occur at a much lower rate at a given locus, probably <10−7 per generation. It has been suggested that the propensity of a DNA sequence to adopt non-B conformations increases the likelihood of DNA double-strand breaks37 and, hence, structural variation; however, only in the case of translocations involving palindromic AT-rich repeats on chromosome 22 has this fragility been demonstrated to result in a higher rate of rearrangement at a specific locus38.
Comparative genomic analysis39,40 and, to a lesser extent, diversity within species41 have indicated that some duplications are transpositional in nature (in other words, the additional copy is inserted in a distant genomic location); this is especially prevalent in subtelomeric and pericentromeric regions of the genome40. The mechanism(s) of duplicative transposition are not well understood and deserve detailed characterization.
One special class of mutational mechanisms generating smaller structural variants is the random integration of cellular mRNA transcripts by the action of the LINE-1 reverse transcriptase in a process known as retrotransposition. Many of the resultant processed pseudogenes can be transcribed42, so although this mechanism may not account for a high proportion of all structural variants, it is likely to have a disproportionately large functional impact.
These differences in the mutational mechanisms generating structural variation have important implications for population genetic models of structural variation. Mutational models for SNPs are not appropriate for microsatellites (and vice versa), and the same is true for different mechanisms generating structural variation. The ‘infinite sites’43 and ‘infinite alleles’44 models that are commonly used for modeling SNP variation may well be appropriate for structural variants generated by NHEJ, but the higher rate of NAHR, and the multiallelic nature of some resultant structural variants, suggests that a model closer in nature to the stepwise mutational models45 used for microsatellites46 would be more appropriate.
There is preliminary evidence that mutation rates for certain rearrangements differ markedly between apparently healthy individuals38,47,48. For example, at several loci, it has become apparent that carriers of heterozygous inversions are more susceptible to meiotic rearrangements involving sequences within the inverted interval47. A much greater understanding of the degree to which this is a general phenomenon8 is needed to discern whether mutation rate polymorphism needs to be integrated into population genetic modeling of structural variation.
Explicit population genetic models are required for hypothesis testing and parameter estimation from samples of genetic variation, especially in relation to selection, demography and recombination. At present, most models do not account for structural variation and as a result are likely to give inaccurate estimates and unreliable inferences in some structurally variable regions of the genome (see below). It will be necessary to develop improved models and analyses to cope with the complexity of genetic variation in these genomic locations. Models that consider alleles with distinct structures differently are required.
Our understanding of the mutational mechanisms generating structural variation would be catalyzed by the high-throughput mapping of thousands of structural variation breakpoints at the nucleotide level, which at present is a laborious multistep process even for small numbers of variants.
LD is a term used to describe the nonrandom association between alleles at different loci. LD contains information about demographic history49, recombination50,51 and gene conversion52; it can be used to infer the action of natural selection53,54 and is important for the design and analysis of genome-wide association studies55.
The aim of association mapping is to assay directly or indirectly a large portion of genetic variation in a sample by genotyping a subset of well-characterized, easy-to-assay markers (typically SNPs). Estimation of the extent of LD between SNPs and structural variation is thus crucial and should inform the design of next-generation genome-wide association studies. Existing data suggest that the extent of LD between SNPs and CNVs is lower than LD among SNPs alone11,15. There are several reasons for this. First, the enrichment of CNVs around duplicated sequences places them in the most difficult regions to analyze using high-throughput SNP typing technology56. As LD decays with increasing distance, lower LD between CNVs and SNPs can result from the reduced density of genotypable SNPs in the vicinity of many CNVs. This results in CNVs associated with segmental duplications being less successfully tagged than CNVs in single-copy regions of the genome (Fig. 3). Second, as the mutation rate of some CNVs is higher than that of SNPs, low LD with SNPs could result from recurrent mutation generating allelic diversity. This lower LD has important implications for our prospects of understanding the phenotypic impact of structural variation, as existing indirect association methods will not fare well in the face of allelic diversity57.
Population genetic analyses of genome-wide variation have shown that 80% of allelic recombination is confined to hotspots covering 10%–20% of the genome8. Structural variation is not integrated into the simple models of recombination from which these rate estimates are derived, and it can be expected to decrease the reliability of these estimates in some regions of the genome, especially those harboring common inversions (Fig. 4 and Supplementary Fig. 2 online). Experimental data describing local patterns of recombination around structural variants of all sizes are needed to address this issue and would also improve methods for detecting signals of natural selection acting on structural variants.
The distribution of genetic variation across populations within a species is shaped by population demography and can be measured in different ways. The FST family of statistics58,59 aims to quantify the proportion of variation within and between populations. Studies using diverse marker sets have shown conclusively that humans show little population differentiation relative to other comparable species; typically only 10%–15% of variance occurs between continental groups60,61. This feature of human diversity accords with the archaeological and paleontological evidence for a recent common origin in Africa some 50,000 years ago, which affords little time for extensive differentiation2. A survey of 67 common CNVs amenable to genotyping estimated that only 11% of the variation was attributable to differences between populations11; most of these variants are shared between populations and thus predate the migration out of Africa. Clearly, the distribution of structural variation between populations, like all other forms of genomic variation, is dominated by the recent common ancestry of humans.
The small proportion of variation that can be attributed to differences between populations contains signals of genetic relatedness. A simple method for analyzing these signals is to cluster individuals into an optimal number of populations without regard to their geographical origin62. The four HapMap populations can be clustered into three groups that reflect their continent of origin with high confidence using genotypes at only 67 common autosomal CNVs. The CNV-based clustering is qualitatively similar to that obtained for 67 common autosomal SNPs (Fig. 5) and is sufficient to assign correctly 209/210 individuals to their known continent of origin11.
Selection can distort the population distribution of a given variant such that it is markedly more (or less) differentiated than the average. Thus, identifying unusual patterns of population differentiation should highlight structural variants that have been under recent selective pressures. Studies of individual disease-related loci have identified some notable structural variants with unusually high levels of population differentiation3,63,64, and a recent genome-wide CNV survey replicated many of these findings and identified additional outliers that may have been under recent population-specific selection (see below for specific examples)11. Only a minority of CNVs can be genotyped with high confidence in existing data sets, yet measures of population differentiation such as FST rely on qualitative genotypes. Thus, to quantify the population differentiation of all forms of CNV, it has been necessary to adapt these traditional statistics to the underlying quantitative data11.
Existing CNV maps are not compatible with a model in which structural variation is distributed randomly across the genome. In addition to broad-scale mutational biases toward subtelomeric and pericentromeric regions, there is substantial evidence that CNVs are biased away from functional sequences of all classes11,18,65. The simplest explanation for such observations is that as a class, CNVs are slightly or moderately deleterious and that selection acts against (or ‘purifies’) changes in copy number of functional sequences. Further support for the action of purifying selection may be apparent in the site frequency spectrum of CNVs recorded in recent population surveys, which seems to be skewed toward rare variants66. Ascertainment of CNV by CGH is complicated by incomplete power and a nontrivial false positive rate, which makes formal analysis of the frequency spectrum extremely challenging.
Karyotypic analyses of individuals with segmental aneuploidy suggests that the genome is more tolerant of duplication than deletion67. This finding has been confirmed by recent higher-resolution techniques: deletions seem to be biased away from OMIM genes and RefSeq genes compared with duplications. When comparing large-scale (>10 kb) copy number variation ascertained with the same platform, duplications show a much larger median length (120 kb versus 43 kb) and higher average frequency than deletions do11,15. Balanced changes should, in principle, be less deleterious, although such rearrangements may disrupt genes directly (if the breakpoint occurs within a gene) or indirectly (through position effect), although informative data on this in humans are extremely scarce.
There are several notable large-scale differences in gene copy number between humans and chimpanzees, some of which may have become fixed as humans adapted to their changing environments68,69. There are many circumstantial claims of natural selection acting on existing structural variants, the majority of which involve genes mediating innate or acquired immunity. A number of deletions (including those in genes such as globins and SLC4A1) are found only in areas of the world where malaria is endemic. Extreme population differentiation has been noted for the CCL3L1 polymorphism11,63, which influences human susceptibility to HIV infection. Characterization of a recently discovered 1-Mb inversion at 17q21 has demonstrated unusual patterns of divergence between the two inversion alleles64 and has been adduced as a signal of natural selection acting on the derived inversion haplotype.
Most studies of CNVs have detected an enrichment of genes involved in sensory perception, immune response and cell adhesion within polymorphic sequences65. This observation has been used to argue for the action of positive selection65. We must think carefully about invoking such forces. On one hand, LD-based signatures of recent positive selection are enriched within certain gene ontology classes70. However, genes overrepresented in CNV are also enriched within segmental duplications71, which themselves show elevated structural dynamism. An enrichment of these classes within structural variants may reflect, in part, mutational biases and perhaps the genomic ‘fossils’ of past selective events that acted on gene copy number.
Balancing selection might also explain why some classes of genes are enriched for structural variation, but early genome-wide surveys have suggested that ancient balancing selection is rare within human populations72,73. In the long term, it seems that gene duplication is often a more stable evolutionary strategy than balancing selection for accommodating similar but differentiated gene functions, but the possibility that recent balancing selection is more common merits further investigation.
Each structural variant requires detailed characterization to fully resolve its evolutionary history. Having robust genotyping assays for specific rearrangements would facilitate this characterization. Qualitative assays that are targeted to the breakpoints of structural variants have significant advantages over quantitative assays27, including the possibility that they can be applied to balanced rearrangements such as inversions74. With such genotyping assays in hand, it should be possible to estimate the age of structural variants (as has been possible for other allelic variants75) and integrate them into their surrounding haplotypes, which will provide much informative data on patterns of selection54. It is worth noting that existing haplotype-based tests9,54,70 for selection often assume that a variant does not perturb neighboring sites of variation, so these methods often need to be adapted to take into account the size of a structural variant11.
We are currently observing the birth of a new subdiscipline in the population genetics of structural variation. Although the same questions can be asked about all types of variation, the theoretical and experimental tools required for investigating structural variation will inevitably require some adaptation. We emphasize that the mutational complexity of structural variation precludes a one-size-fits-all approach to modeling structural variation.
Population genetics has the power to provide insights into the demographic history of populations, selective pressures acting on genetic variation and mutational processes generating diversity. We see the future of the population genetics of structural variation as making substantial contributions to the latter two areas. By virtue of their number and simplicity, SNPs and microsatellites will remain the markers of choice for illuminating population demographic histories.
Clearly, our current knowledge of the locations, frequencies and types of structural variation in the human genome is rudimentary, but we anticipate rapid growth in the discovery of novel variants, especially smaller ones, over the next few years. Integrating structural variation detection within new sequencing technologies will be critical to take advantage of the coming era of human genome–wide resequencing.
We end by highlighting two important future challenges: (i) identifying structural variants that have facilitated recent human adaptation to novel environmental pressures and (ii) using our understanding of the population genetics of structural variation to identify structural variants influencing disease risk. These two challenges epitomize the benefits to evolutionary and medical genetics of an improved understanding of the population genetics of structural variation.
The authors are grateful to G. Coop and C. Tyler-Smith for their comments on an earlier manuscript and to D. Andrews for data processing.
Note: Supplementary information is available on the Nature Genetics website.
COMPETING INTERESTS STATEMENT The authors declare no competing financial interests.
Donald F Conrad, Department of Human Genetics, University of Chicago, Chicago, Illinois 60637, USA.
Matthew E Hurles, Wellcome Trust Sanger Institute, Cambridge CB10 1SA, UK.