|Home | About | Journals | Submit | Contact Us | Français|
Structurally complex genomic regions are not yet understood. One such locus, human 17q21.31, contains a megabase-long inversion polymorphism1, many uncharacterized copynumber variations (CNVs), and markers that associate with female fertility1, female meiotic recombination1–3, and neurological disease4,5. Additionally, the inverted H2 form of 17q21.31 appears to be positively selected in Europeans1. We developed a population-genetic approach to reveal complex genome structures and identified nine segregating structural forms of 17q21.31. Both the H1 and H2 forms of the 17q21.31 inversion polymorphism contain independently derived, partial duplications of the KANSL1 (KIAA1267) gene; these duplications, which produce novel KANSL1 transcripts, have both recently risen to high allele frequencies (26% and 19%) in Europeans. An older H2 form, lacking such a duplication, is present at low frequency in Europeans and Central African hunter-gatherer populations. We show that complex genome structures can be analyzed by imputation from SNPs.
Simple, common deletion and duplication polymorphisms have been typed in large cohorts6,7, found to segregate on SNP haplotypes6,7, and associated with many human phenotypes via proxy SNPs7–10. Complex structural mutations have also been described in specific patients and cancers11. By contrast, little is known about genomic loci that show population-level complexity –loci at which germline structural mutations in many different ancestors have given rise to complex patterns of variation. Such loci have not been analyzed in HapMap12,13 or the 1000 Genomes Project14,15, as they are assumed to require reconstruction of each structural form from genomic clones or fluorescence in situ hybridization (FISH).
We hypothesized that extensive structural information is present in the statistical relationships among genome structural features in populations. A population-genetic approach to reconstructing structural forms of a complex locus would require two capabilities. First, individual structural features would need to be accurately typed in population cohorts. This is increasingly possible, due to widely available genome sequence data14 and approaches for typing structural polymorphism in such data10. Second, it would be necessary to infer the haplotypes formed by multiple structural features; such relationships might be made visible in haplotype sharing by relatives or statistical phasing in populations. We sought to understand the extent to which such an approach could reveal genome structures at 17q21.31.
We first examined what structural features of 17q21.31 vary in populations. In addition to the known inversion at 17q21.311, we used array and whole-genome sequence (WGS) data to identify several segments of 17q21.31 that exhibit distinct patterns of population-level variation in copy number (Fig. 1a, Methods). We located most of the boundaries of these segments at high resolution by finding read-depth transitions and then (where possible) breakpoint-spanning reads in data from the 1000 Genomes Project pilot14 (Methods, Supplementary Fig. 1 Supplementary Table 1). This analysis defined three common, overlapping duplication polymorphisms: (i) duplication α, a 150-kb duplication (covering the 5’ end of the KANSL1 gene), previously partially characterized on the H2 haplotype of 17q21.3116; (ii) duplication β, a novel, longer (300 kb), duplication, overlapping duplication α, but segregating separately from it; and (iii) duplication γ, a highly multi-allelic 218-kb duplication at the distal end of the 17q21.31 inversion, covering much of the NSF gene.
We next sought to type each of these structural features in populations. To address longstanding challenges in measuring the integer copy number of multi-allelic duplication CNVs17, we deployed two new methods: (i) analysis of read depth by applying the Genome STRiP algorithm10 to whole-genome sequence (WGS) data from 946 unrelated individuals sampled in the 1000 Genomes Project14 (Fig. 1b–d, Methods); and (ii) a droplet-based approach to digital PCR (ddPCR)18, to analyze 120 father-mother-offspring trios from HapMap (Fig. 1e–g, Methods). These measurements of integer copy number, which varied from 2 to 8, were 99.1% concordant across 234 genotypes in overlapping samples, validating both methods (Fig. 1h–j). The integer copy numbers of these segments were then de-convolved into the contributions of the three overlapping duplication polymorphisms (Fig. 1a, Supplementary Fig. 2–3, Supplementary Tables 2–9). The state of the inversion polymorphism was inferred from more than 100 tagging SNPs.
We then sought to understand these complex patterns of variation in terms of structural haplotypes – which structural features segregate separately or together. By inferring the number of copies of each CNV segment that segregated on transmitted and untransmitted haplotypes in each trio, we determined the chromosomal phase of each of these segmental copy numbers – with respect to one another, with respect to the inversion polymorphism, and with respect to SNPs across the locus (Fig. 1k,l, Fig. 2, Supplementary Table 10, Methods). All four structural features (three duplications and the inversion) were highly polymorphic, but they segregated as only nine common haplotypes (Fig. 2). Applying a maximum-likelihood model to unphased copy number measurements we derived from 1000 Genomes sequence data (Fig. 1b–d), we inferred frequencies of these haplotypes in twelve populations (Supplementary Table 11).
The above analyses established the copy-number content of segregating haplotypes but did not establish the genomic locations of structural features. We inferred their locations from both sequence data (breakpoint-spanning reads) and linkage disequilibrium to SNPs. Both forms of evidence indicated that duplications β and γ are tandem duplications (Supplementary Table 1). In contrast, duplication α appears dispersed to a site 600 kb away from the original copy (Fig. 2, haplotype H2.α2), an observation that was reported earlier16 and that we confirmed by reconstructing a clone spanning an earlier gap in the H2 sequence (Supplementary Note).
Although the H2 inversion form of 17q21.31 is reported to harbor little diversity1,9,10, we found that human populations also possess an older, structurally distinct H2 haplotype at low frequency. Multiple lines of evidence indicate that this rarer H2 structural form (H2.α1) is the ancestral H2 structure. First, H2.α1 resembles the H1 structure more closely than does the previously described H2 structure1,16 (H2.α2) (Fig. 2, Supplementary Fig. 1, Supplementary Note). Second, we identified H2.α1 haplotypes in Central African hunter-gatherer populations (including two Mbuti Pygmies and one Biaka Pygmy, among 13 and 21 individuals sampled from those populations by the Human Genome Diversity Panel16); such populations could have harbored H2.α1 over the long period (estimated at 2–3 million years1,16) during which H2 diverged from H1. The H2.α1 structure provides a potential missing link that would explain how the inversion could have occurred by a simple non-allelic homologous recombination (Supplementary Fig. 4, Supplementary Note).
The H2 inversion state is common in West Eurasians and rare in most other populations, which has been attributed to recent positive selection1. We found that other structural variations at 17q21.31 exhibit even greater population differentiation. Two distinct duplications (α and β in Fig. 1a and Fig. 2), each affecting the 5’ coding exons of the KANSL1 gene, have arisen independently on the H1 and H2 backgrounds. Both duplications have reached high allele frequency (19% and 26%) in West Eurasian populations, together comprising almost half of all European chromosomes; but we observed the α and β duplications only 1 and 0 times (respectively) among 502 East Asian chromosomes and 316 African chromosomes analyzed in Phase 1 of the 1000 Genomes Project (Fig. 2, Supplementary Tables 4–9, 11), placing both among the human genome’s most population-differentiated polymorphisms (Methods and Supplementary Fig. 5). The α and β duplications have reached these highly differentiated allele frequencies in parallel at the same locus and in the same populations, a pattern similar to that observed at other loci (such as the LCT and APOL1 loci in African populations21,22) that have undergone recent selection.
We estimated two dates for each duplication - a coalescent age of haplotypes sampled today, and the age of the duplication events. The first can be estimated from the divergence of sequences flanking the duplications, the second by comparing the sequences of the duplication copies. To generate these data, we selectively captured and sequenced the 17q21.31 region in H1.β2 and H2.α2 homozygotes. We estimate the coalescence of the sampled beta-duplicated H1 chromosomes at 12 thousand years ago (kya). Divergence of otherwise unique sequences within the beta duplication suggests that the duplication itself occurred 20–27 kya. For the alpha-duplicated H2 chromosomes, we estimate an average coalescence of 17 kya, but the duplication itself appears much older (>1 Mya) than its rise to high frequency in West Eurasia. (See Supplementary Note S8 and Supplementary Table 12 for details of the dating and discussion of uncertainty surrounding the dates.)
The parallel increases in frequency of duplication α (on H2) and duplication β (on H1) in the same populations invite the hypothesis that they could influence a common phenotype. Both duplications involve the 5’ exons of KANSL1 (also called KIAA1267, MSL1v1, and CENP-36.). We found that both α and β give rise to novel KANSL1 transcripts (which we confirmed by rtPCR and sequencing) in which the 5’ exons of KANSL1 fuse to cryptic exons that terminate its coding sequence. The encoded proteins would retain the coiled coil domain of KANSL1 but would lack its PEHE domain (Supplementary Fig. 6). Interestingly, a similar truncation in the Drosophila ortholog of KANSL1, GC4699/E(nos), was identified in a mutagenesis screen for modifiers of Nanos and was found to enhance the effect of a Nanos hypomorph on age-dependent female fertility and germ line stem cell differentiation26. The precise role of KANSL1 in these processes is unknown, though it is found within a chromatin-modifying complex24,25.
It is important to understand how genome structures relate to variation in phenotypes. Complex genome structures are today not typed in sequencing or array-based genome wide association studies. Structurally complex regions are assumed to be poorly captured by LD to SNPs27,28. However, our analysis suggested that the structural diversity at 17q21.31 arose from a definable series of structural mutations (Fig. 2); each mutation likely arose on a specific haplotype and may continue to segregate on that haplotype. Such haplotypes might be made visible by combinations of many SNPs.
We therefore analyzed the SNP haplotypes on which each 17q21.31 structural form segregates in European populations (Fig. 3). The structural forms of 17q21.31 bore strong relationships to SNP haplotypes on both sides of the distal end of the 17q21.31 inversion, where the polymorphic CNV copies reside (Fig. 3). These results suggested it might be possible to capture 17q21.31 structural diversity through statistical imputation from SNPs30–32.
We therefore constructed and evaluated the first imputation resource for a structurally complex locus. We created reference haplotypes from 94 CEU trio founders and a cosmopolitan cohort of 373 unrelated individuals from the 1000 Genomes Project phase 1 data, phasing structural variation along with 934 reference SNP haplotypes and removing any SNPs that fell within CNVs (Methods).
We evaluated imputation efficacy for each structural feature using leave-one-out tests. In each test, we selected a different test individual from the reference cohort and removed his structural-variation data from the reference genotypes. The test individual was always a CEU trio founder, for whom we had separately determined CNV states by ddPCR (Fig. 1e–g) and trio-based phasing (Fig. 1k,l). In each simulation, we used the rest of the reference genotypes as an unphased panel, together with the backbone SNP data from the test individual, to phase all the data and then impute (using Beagle33) the states of the structural alleles into the test individual. We then compared this prediction to the independently derived experimental data.
As a metric of imputation efficacy, we evaluated the statistical correlation (r2) of the experimentally determined structural state with the imputation-based, probabilistic dosage of each structural feature (Table 1, Supplementary Table 13). This metric estimates the efficacy of imputation; 1/r2 gives the proportional increase in sample size that would be required (in additive tests of association) to recover the statistical power obtained by explicitly typing each variant. For the four large structural features analyzed (the α, β, γ duplications and the inversion), imputation from low-coverage genome sequence data yielded structural determinations that correlated strongly (r2 = 0.99, 0.93, 0.84, 1.00) with the true diploid copy number (Table 1). This efficacy was only modestly lower using earlier panels of SNPs typed in GWAS (Table 1). Imputation was able to capture the multi-allelic CNVs substantially better than individual SNPs were (Table 1). These results suggest that imputing reference haplotypes into available SNP data will allow structural forms of 17q21.31, and perhaps many other such loci, to be evaluated for relationships to human phenotypes.
We have described a population-genetic approach for characterizing structurally complex and diverse genome variations. Our approach is complementary to existing methods based on FISH and clone reconstruction. Drawing upon population-level sequence data sets, this approach will yield models of how structurally multi-allelic loci vary in populations. Our results motivate the creation of integrated SNP/structure haplotype maps that will allow complex genome structures to be imputed into many other genomes using available SNP data. Our results and methods offer new ways of analyzing complex genome structures and relating them to human disease.
We used a combination of array and sequence data to find the breakpoints of each CNV in the 17q21.31 region. Using array-based data, we identified the approximate span of CNV segments at kilobase resolution. We then refined the boundaries of these segments to 100 bp resolution by comparing read-depth profiles. Ultimately, the precise breakpoints of these rearrangements were identified by searching the 1000 Genomes data. Details of this analysis are provided in the Supplementary Note.
To determine integer copy number of CNV segments (regions 1–3), we used a droplet-based digital PCR method18. We designed a pair of PCR primers and a dual-labeled fluorescence/FRET oligonucleotide probe to both the CNV locus and a two-copy control locus. Genomic DNA, and primers and probes for both assays were compartmentalized into droplets in an oil/aqueous emulsion (QuantaLife). We performed PCR amplification on the emulsion and then counted the number of droplets that were positive and negative for each fluorophore with a droplet reader (QuantaLife). Absolute copy number for the CNV locus was determined by comparing droplet counts of the CNV locus to the two-copy control locus. This method is described in greater detail in the Supplementary Note.
WGS read depth was also used to determine copy number in regions 1,2 and 3. We adapted the Genome STRiP genotyping method10 to analyze duplications in low coverage sequencing data from 1000 Genomes Phase 1. Details of this analysis are available in the Supplementary Note.
The ancient, megabase-long inversion polymorphism at 17q21.311,16 has resulted in a large number of fixed differences between the two inversion states, because opposite alleles of the inversion cannot viably recombine with each other within the inverted region. These two inversion states therefore define two long haplotypes, H1 and H2, with hundreds of fixed differences between them. We refined these haplotypes using low-coverage data from Phase I of the 1000 Genomes Project, finding 1886 sites that are in perfect LD (=1) even in the large ascertainment (1,000 individuals) afforded by these data. Although the megabase inversion polymorphism has been reported to “toggle” on the longer timescales of primate evolution16, no study to date has reported any discordance in humans between the cytogenetic orientation of this megabase-long segment and the state of these inversion-proxy SNPs. We therefore used these long SNP haplotypes to diagnose inversion state (Supplementary Tables 2–3). We observed two individuals in 1000 Genomes for whom SNP genotypes in specific segments within the inverted region suggested an H1/H2 type that was discordant from that suggested by the rest of their SNPs at the locus. The clustering of these discordant SNPs suggested that these individuals’ genomes reflect gene-conversion or double-recombination events that occurred within the inverted regions; these individuals were not included in subsequent analyses.
Inferring haplotypic contributions to diploid copy number (Fig. 1j) was addressed with a joint maximum-likelihood analysis of genotypes, allele frequencies, and inheritance patterns in trios. Each population was analyzed separately. We considered all possible combinations of integer copy number (on each of the four haplotypes in a trio – paternal transmitted, paternal untransmitted, maternal transmitted, maternal untransmitted) that were consistent with the diploid copy-number measurements from all three trio members from ddPCR. See the Supplementary note for details of this analysis and a description of the expectation-maximization algorithm.
For regions 1, 2, and 3, diploid copy number was first measured using read depth from whole genome sequencing by the Genome STRiP algorithm10 in low coverage WGS data from populations of unrelated individuals from 1000 Genomes. Allele frequencies were determined from genotype frequencies with an expectation-maximization (EM) algorithm similar to that described in S4 of the Supplementary Note, but without the additional constraints provided by inheritance in trios (Supplementary Tables 2–4).
We determined phased structural genotypes for 47 CEU trios based on our model of the nine structural haplotypes, the ddPCR diploid copy number estimates for regions 1, 2 and 3, the assayed H1/H2 inversion state in each sample, and by assuming Mendelian inheritance in the trios. Phased haplotypes for the founders in these trios are listed in Supplementary Table 10. In a similar manner, we determined phased structural genotypes for 373 additional unrelated samples (where genotype could be determined without the benefit of trio inheritance constraints) from 1000 Genomes, using diploid copy number estimates from read depth genotyping in the 1000 Genomes Phase 1 low coverage sequence data and determining H1/H2 inversion state based on a tag SNP rs17660065. We used this set of resolved structural haplotypes for 467 individuals (the 373 individuals from 1000 Genomes plus the 94 trio founders) to evaluate imputation of the structural haplotypes from the genotypes of nearby SNPs using Beagle29. We evaluated imputation accuracy through a series of “leave one out” trials in which we withheld information on one individual from the reference panel and then imputed the structural haplotypes for that individual based on their genotypes at surrounding SNPs. Details of allelic encoding, evaluation methodology, and estimation of CNV dosages are available in the Supplementary Note.
We designed primers to amplify the KANSL1 fusion gene transcript created by the duplication using information from genomic breakpoints and publicly available RNA sequence data23. The primer design to amplify the KANSL1 fusion gene transcript created by the duplication was informed by genomic breakpoints and mRNA clone BC006271, which is likely a complete fusion transcript resulting from the duplication. These analyses are discussed in detail in the Supplementary Note.
We performed targeted capture and sequencing of the 17q21.31 region for 3 individuals homozygous for duplication and 4 individuals homozygous for duplication. Excluding haplotypes that showed evidence of recombination with other structural forms, we computed the average pairwise diversity among chromosomes with the duplication and separately among chromosomes with the duplication. To estimate a coalescence date, the amount of diversity surrounding each duplication was calibrated with human-chimp divergence over the same region. Human-chimpanzee speciation was assumed to be 6 Mya. Details of the coalescence analysis and direct duplication dating are available in the Supplementary Note.
To evaluate the stratification of each duplication, we calculated the fraction of SNPs (from 1000 Genomes phase 1) having similarly high derived allele frequency in the European populations sampled (379 samples from the CEU, FIN, GBR, IBS and TSI population groups) that had similarly low derived allele frequency across the non-European populations sampled (471 samples from the CHB, CHS, JPT, LWK and YRI population groups). Details of this analysis as well as analysis of allele frequency differentiation within Europe are provided in the Supplementary Note.
To compare the efficacy of imputation to using a single best tag SNP, we computed the correlation (r2) between the diploid copy number of each CNV (alpha, beta, gamma and the H1/H2 state encoded as 0,1,2) and the dosage of each SNP in each reference panel, encoded as 0,1,2. For the Illumina 1M and Affymetrix 6.0 comparisons, we considered only the subset of SNPs from the HapMap3 panel present on each array. We report the highest r2 across all SNPs in the panel.
This work was supported by a Smith Family Award for Excellence in Biomedical Research to SM, by the National Human Genome Research Institute (U01HG005208-01S1), and by startup resources from the Harvard Medical School Department of Genetics. Joshua Korn provided an early version of software for visualizing haplotype diversity. Nadin Rohland and Tom Mullen contributed expertise on laboratory experiments. We thank Nick Patterson, David Reich, David Altshuler, Eric Lander, Brian Browning, Joshua Korn, Jesse Gray, Chris Patil, Giulio Genovese, Aswin Sekar, and Sharon Grossman for helpful conversations and/or comments on the manuscript.
Sequence data is available at the Sequence Read Archive under accession number SRA052055.1.
AUTHOR CONTRIBUTIONSSM, LB, and RH conceived the strategy for population-genetic dissection of structurally complex loci. LB performed all laboratory experiments and multiple computational analyses including the estimation of haplotype frequencies, delineation of CNV regions and alignment of next generation sequence data. RH performed computational analyses of the 1000 Genomes data, including finding breakpoint spanning reads for CNVs, and integrated analyses of SNP/CNV haplotypes. MZ performed analyses of sequence data to determine large-scale structures, to estimate coalescence and mutation dates, and to reconstruct the evolutionary history of the locus. RH and LB developed the imputation strategy. SM, LB, RH, and MZ wrote the manuscript.