|Home | About | Journals | Submit | Contact Us | Français|
Recombination, together with mutation, is the ultimate source of genetic variation in populations. We leverage the recent mixture of people of African and European ancestry in the Americas to build a genetic map measuring the probability of crossing-over at each position in the genome, based on about 2.1 million crossovers in 30,000 unrelated African Americans. At intervals of more than three megabases it is nearly identical to a map built in Europeans. At finer scales it differs significantly, and we identify about 2,500 recombination hotspots that are active in people of West African ancestry but nearly inactive in Europeans. The probability of a crossover at these hotspots is almost fully controlled by the alleles an individual carries at PRDM9 (P<10−245). We identify a 17 base pair DNA sequence motif that is enriched in these hotspots, and is an excellent match to the predicted binding target of African-enriched alleles of PRDM9.
In humans and many other species, recombination is not evenly distributed across the genome, but instead occurs in “hotspots”: two kilobase (kb) segments where the crossover rate is far higher than in the flanking DNA sequence1,2,3. The highest resolution genetic map in contemporary humans to date, the “deCODE Map”, is based on about 500,000 crossovers identified in 15,000 Icelandic meioses4. However, a limitation of maps built in people of European descent4,5,6 is that they may not apply equally well in other populations, as suggested by comparisons of maps across ethnic groups4,7,8,9 and patterns of linkage disequilibrium (LD) breakdown which suggest that more of the genome may be recombinationally active in West Africans10. It is known that a major determinant of the positions of recombination hotspots is PRDM9, a meiosis-specific histone H3 methyltransferase whose zinc finger (ZF) domain binds DNA sequence motifs11,12,13. In Europeans, PRDM9 ZF arrays are predominantly of two similar types, “A” and “B”, both of which bind the 13-bp motif CCNCCNTNNCCNC11. In contrast, 36% of West African alleles are not of the A or B type9,13. Sperm typing of males who carry neither the A nor the B allele has shown no evidence of crossover activity at recombination hotspots associated with the 13-bp motif9.
To investigate differences in the crossover landscape across human populations, we built a genetic map in African Americans, who have an average of about 80% West African and 20% European ancestry, leading to genomes comprised of multi-megabase stretches of either West African or European ancestry14. Computational approaches, including HAPMIX15, have been developed to infer the probability of 0, 1 or 2 European or African alleles at each locus in individuals genotyped at hundreds of thousands of single nucleotide polymorphisms (SNPs)15,16,17. Positions where the inferred number of European or African alleles changes reflect crossover events that have occurred since admixture began (on average six generations ago15). The change in the probability of European ancestry between adjacent SNPs can be interpreted as the probability of such a crossover between them. We inferred crossover events in 29,589 apparently unrelated African Americans who had been genotyped on SNP arrays in genetic association studies (Methods; Figure 1A). To minimize false-positive crossovers, we restricted to crossovers that HAPMIX inferred with probability of >95%, and that were flanked by a minimum of 2 centimorgan (cM) stretches where the ancestry was inferred to be unchanging (Note S1). This produced 2,113,293 highconfidence crossovers, with a typical switch point resolved within 70kb with probability 50% (Note S1).
To build a high resolution African American genetic map (AA Map), we leveraged the fact that most crossovers occur in hotspots shared across individuals1 (Methods). Intuitively, while any crossover can only be roughly localized, inter-SNP intervals that are inferred to have an appreciable probability of crossover in multiple individuals are likely to contain recombination hotspots, allowing much better localization (Figure S1). To implement this idea, we modeled the recombination rate for each inter-SNP interval as shared across individuals, and used a Markov Chain Monte Carlo (MCMC) to sample rates consistent with the data (Methods). This provides well-calibrated estimates of the crossing-over rate between all pairs of markers as well as estimates of rate uncertainty (Note S1 and Figure S2). We find that the interval size at which the average recombination rate is equal to the standard error is 6 kb, which is the same accuracy that would be expected from a map based on 500,000 crossovers whose boundaries were precisely resolved (Note S1). Despite this high resolution, there are also some limitations. First, the AA Map does not separately infer male and female recombination rates (it is a sex-averaged map) and requires normalization by the total map length (like LD maps3,18). Second, the map has less resolution and may miss a higher fraction of true crossovers at loci where it is more difficult to detect and resolve crossovers due to low SNP density or low differentiation between West Africans and Europeans. Third, the map may be biased where ancestry deviates from the average, for example at 8q24, where the 10% of the people in this study who have prostate cancer have an elevated proportion of African ancestry19. Fourth, the map assumes that all individuals are unrelated, whereas in fact there is likely some shared ancestry, resulting in multiple counting of some crossovers and an overestimate of map precision.
To assess the accuracy of the AA Map, we generated an independent African American pedigree map by analyzing 222 nuclear families that included 1,056 meioses in which we could directly detect crossovers between parent and child (Methods; Figure 1A). Examination of the AA Map rate around directly detected crossovers confirms the high resolution: the rate around such crossovers shows at least as strong a peak as that observed in maps based on LD2,3,18 (Figure S3). We next computed correlation coefficients for both the AA Map and the deCODE Map4 to maps derived from the breakdown of LD in Europeans (CEU) and West Africans (YRI)18. At broad scales (>3 Mb) they are almost identical (ρ>0.97; Table 1) At fine scales, the AA Map is more accurate (Table 1 and Table S1), as reflected in a modest improvement in correlation to the CEU Map at a 3kb scale (ρAA,CEU(3kb) = 0.66 vs. ρdeCODE,CEU(3kb) = 0.58), and a major improvement for the YRI Map (ρAA,YRI(3kb) = 0.71 vs. ρdeCODE,YRI(3kb) = 0.53). The deCODE Map is more correlated to the CEU Map than to the YRI Map at scales <1 Mb, suggesting that this map, built in Icelanders, reflects more European recombination rates. The AA Map shows the opposite pattern, suggesting that it reflects more West African recombination patterns.
We compared the rate estimates for all four maps (AA, deCODE, CEU and YRI) over a 200 kb region within the MHC locus where recombination rates in European males have been characterized through sperm typing1 (Figure 1B). The AA Map detects five of six known hotspots, and localizes them to within 1 kb (the sixth hotspot is weak, with a peak male rate below the genome average1). Strikingly, the two maps based on samples with African ancestry (AA and YRI) found a hotspot not present in either map based on samples of European ancestry (deCODE and CEU) (Figure 1C; Figure S4 gives a second example). We confirmed that such “African-enriched” hotspots also occur genome-wide, by examining 2,375 loci with recombination rate peaks in the YRI Map (>5 cM/Mb) but not the CEU Map (<1 cM/Mb), and finding a rate rise in the independently generated AA Map, but not in the deCODE Map (Figure S5A). In the reciprocal experiment searching for European-specific hotspots, we find no such evidence for genuine ancestry specificity; at loci with recombination rate peaks in the CEU Map but not the YRI Map, there are weak peaks in both the deCODE and AA maps (Methods; Figure S5B). Thus, hotspots active in Europeans are consistently “shared” with YRI and African Americans, while populations with African ancestry harbor additional, non-shared hotspots we call “African-enriched”.
To understand the features of recombination in West Africans that differ from Europeans, we estimated the degree to which each African American person’s crossovers occur in “African-enriched” hotspots, compared with “Shared” hotpots, a phenotype we refer to as their “African-enrichment” (AE). We view each individual’s crossovers as sampled from a mixture of two genetic maps—an “S Map” of shared hotspots based on the deCODE Map, and an “AE Map” of African-enriched hotspots that is learned from comparing the deCODE and AA Maps—so that the proportion of crossovers assigned to the AE Map is a person’s AE phenotype (Note S4). We tested approximately 3 million SNPs (genotyped and imputed) for association with three phenotypes: AE, usage of LD-based hotspots known to be enriched for the 13-bp motif CCNCCNTNNCCNC20, and genome-wide crossover rate (in pedigrees) (Methods and Note S4). In crossovers detected in unrelated African Americans, the alleles a person carries are only sometimes descended from the ancestor in whom the crossover occurred, thus adding noise to the association signal (nevertheless there is useful signal given the large sample size; Note S4). In the Pedigree Map, association between alleles and AE can be tested directly because we have genotypes in the parents.
The SNP showing the strongest association with AE is rs6889665 (P=1.5×10−246; Figure 2A, Figure S6), which has a derived allele frequency of 29% in YRI and 2% in CEU, and is within 4 kb of the ZF array of PRDM94,9,11,12,13. This SNP is associated with AE in both the pedigree individuals and the unrelated individuals (Note S4), and is also the SNP most strongly associated with usage of LD-based hotspots (P=1.8×10−52) (Table S2). No locus outside PRDM9 is significant (P<0.01 after Bonferroni correction; Table S3). To better understand the association at rs6889665, we inferred the alleles in the PRDM9 ZF array carried by 139 individuals based on sequencing data from the 1000 Genomes Project21, using the reads to infer each individual’s PRDM9 alleles among 29 alleles whose full sequences were previously determined9 (Note S5). Grouping PRDM9 alleles based on how closely their binding target predictions match the 13-bp motif, following Berg et al.9, we find that the ancestral “T” variant at rs6889665 is strongly correlated to “8/8 matches” to the 13-bp motif (including the “A” and “B” alleles ), while the derived “C” variant is almost perfectly correlated to a group of “5/8 match” alleles, all predicted to bind a common, different, 17-bp motif “CCgCNgtNNNCgtNNCC”9. This implies a common historical origin for alleles matching this 17-bp motif (Figure 2B; Figure S7; Note S5). We also experimentally measured the number of zinc fingers in PRDM9 in 354 individuals including 166 African Americans from the pedigree study (Methods). This showed, again, that rs6889665 differentiates PRDM9 alleles into two different classes, with 96% of haplotypes carrying the ancestral allele having <14 zinc fingers, and 93% of haplotypes carrying the derived allele having ≥14 zinc fingers (Figure S7). After conditioning on rs6889665, there is no evidence that ZF length is associated with the AE phenotype. Several SNPs near the PRDM9 ZF array show a conditional association signal that is much weaker than rs6889665, but still significant (Figure 2C; Figure S6; Note S4), with the strongest at rs10043097 (P= 8.3×10−14), upstream of the PRDM9 transcription start site. These SNPs may tag additional variation in PRDM9 ZF array, or potentially expression levels.
To directly identify candidate African-enriched hotspot motifs, we selected 2,454 loci with a high crossover rate in the AE Map and YRI Map (>2cM/Mb over 2kb), and no more than half this rate in the S Map and CEU Map (this set is more powerfully enriched for higher recombination in people of African ancestry than the 2,375 above, as it includes information from the contemporary maps). We compared these to a “control set” of 7,328 candidate hotspots more active in the European than the African derived maps (Methods; Note S6). To identify sequence motifs associated with the African-enriched hotspots3,22, we identified short motifs that occurred at increased frequency in the African-enriched hotspot set (Note S6). Testing all motifs of length 5–9 bases revealed a 9-mer “CCCCAGTGA” (OR=1.79, P=2.24×10−8, Bonferroni corrected P=0.004) which exhibited a kilobase-scale rate peak near occurrences of this motif in African derived maps, but in neither of the European derived maps (Figure S8). Further analysis revealed a strong influence of downstream flanking bases (Figure S9), and degeneracy, yielding a 17-bp consensus sequence “CCCCaGTGAGCGTtgCc” (Figure 3A; more strongly signaled bases are uppercase) with the same consensus obtained when we considered flanking sequence for only odd or even chromosomes, and whether we based the analysis on AE-S or YRI-CEU map comparisons (Note S6). The 500 best matches to this motif have a ~3-fold increase in average rate in the AA and YRI relative to the deCODE and CEU maps (Figure 3B, Figure S8). Hotspots associated with the motif occur in both unique and repetitive DNA (e.g. L1PA10/13 LINE elements; Figure S10) (Note S6). We also compared the 17-bp consensus to the binding motif predicted for “5/8 match” alleles, and found that they match almost precisely (Figure 3A; 10 of 11 bases, P=8.1×10−6).
How much of the African-enriched recombination pattern can be explained by PRDM9? We estimated the fraction of variation in the AE phenotype explained by rs6889665 in our pedigree data after accounting for noise in the phenotype estimation (Note S4). Over 82% of map usage variability is explained by rs6889665 genotype alone. Given there are further influential PRDM9 variants (Figure 3C), this gene may thus explain almost all differences in local rate between the West African and European populations. We next examined rates around 82 narrowly defined (<10kb) crossover sites in 7 individuals homozygous for the derived allele at rs6889665. There is no evidence of hotspots at these loci in either the deCODE or CEU Maps (Figure 3C), in contrast to events in individuals carrying the ancestral allele at rs6889665 (Figure S11). Thus, crossover positions in individuals who are homozygous for the derived allele at rs6889665 are consistent with an entirely different recombination hotspot landscape, which would imply PRDM9 control of all hotspots9. Despite the strong correlation between maps at megabase scales, there is mounting evidence that PRDM9’s influence on crossing-over may not be limited to fine scales4,11: we observe a weakly significant association of rs6889665 with the total number of crossovers genome-wide in pedigrees (P=0.04), corresponding to an average 1.3 crossovers more per meiosis per derived allele, exceeding the strongest previously known association23 at RNF212.
We have shown that PRDM9 alleles that bind a novel 17-bp motif and occur at greatly increased frequency in people of West African ancestry have led to a shift in the recombination landscape compared with people of non-African ancestry. The larger number of hotspots available to West Africans implies that at the population level, crossovers are more evenly distributed than in Europeans10, and thus the shorter extent of West African LD is not due to differences in demographic history alone (such as the lack of an out-of-Africa founder event)24. Our findings also have medical implications, as recombination errors leading to insertions or deletions are known to be associated with recombination hotspots9,22,25. Our results predict that the congenital abnormalities that have been associated with the recombination hotpots bound by PRDM9 “A” and “B” alleles will occur at a decreased rate in people of West African ancestry, whereas new diseases will arise due to recombination errors near African-enriched hotspots.
We assembled SNP array data from 29,589 unrelated people and 222 nuclear families genotyped at 490,000–910,000 SNPs from the Candidate Gene Association Resource (CARe), studies at the Children’s Hospital of Philadelphia (CHOP), the African American Breast Cancer Consortium, the African American Prostate Cancer Consortium and the African American Lung Cancer Consortium. To build a recombination map, we used HAPMIX to localize candidate crossover positions15, and implemented a Markov Chain Monte Carlo (MCMC) that used the probability distributions for the positions of the filtered crossovers to infer recombination rates for each of 1.3 million inter-SNP intervals. We also implemented a second MCMC that models each individual’s set of crossovers as a mixture of a Shared (S) Map similar to the European deCODE Map and an African-enriched (AE) Map, and then assigns each individual an “AE phenotype” corresponding to the proportion of their newly detected crossovers assigned to the AE Map. We imputed genotypes at up to three million HapMap2 SNPs8 using MaCH26, and then tested each of these SNPs for association with the AE phenotype and other recombination-related phenotypes. We identified 2,454 candidate African-enriched hotspots with increased recombination rates in the YRI vs. CEU maps, and in the AE vs. S maps, and searched for motifs enriched at these loci, thus identifying a degenerate 17-bp motif. To study the structure of PRDM9, we measured the length of the PRDM9 zinc finger array and genotyped rs6889665 in YRI, CEU and the CARe nuclear families; we also carried out imputation based on 1000 Genomes Project short read data10 to infer the alleles individuals carry, among 29 previously characterized in a sequencing study of PRDM99.
We are grateful to the participants who donated DNA samples, to David Altshuler, Jerome Buard, Kasia Bryc, Joseph Kovacs, Bernard de Massy, Gil McVean, Bogdan Pasaniuc and Sriram Sankararaman for conversations and critiques, and to Adam Auton for facilitating analysis of the 1000 Genomes Project data.
Analysis was supported by the Wellcome Trust and NIH grants HL084107 and GM091332.
CARe was supported by a contract from the National Heart, Lung and Blood Institute (HHSN268200960009C) to create a phenotype and genotype database for dissemination to the biomedical research community. Eight parent studies contributed phenotypic data and DNA samples through the Broad Institute (N01-HC-65226): the ARIC, CFS, CARDIA, JHS, and MESA studies, as well as the Cardiovascular Health Study (CHS), the Framingham Heart Study (FHS), and the Sleep Heart Health Study (SHHS). Support for CARe also came from the individual research institutions, investigators, field staff and study participants. Individual funding information is available at http://public.nhlbi.nih.gov/GeneticsGenomics/home/care.aspx.
All genome-wide genotyping of the CHOP dataset was supported by an Institutional Development Award to the Center for Applied Genomics from the Children's Hospital of Philadelphia, a research award from the Landenberger Foundation and the Cotswold Foundation. We thank all study participants and the staff at the Center for Applied Genomics for performing the genotyping.
AABCC was supported by a DoD Breast Cancer Research Program Era of Hope Scholar Award to CAH and the Norris Foundation, and by grants to the component studies: MEC (CA63464, CA54281); CARE (HD33175); WCHS (CA100598, DAMD 170100334, Breast Cancer Research Foundation); SFBC (CA77305, DAMD 17966071); CBCS (CA58223, ES10126), PLCO (NCI Intramural Research Program); NHBS (CA100374), WFBC (R01-CA73629) and CPS-II (the American Cancer Society).
AAPCC was supported by grants CA63464, CA54281, CA1326792, CA148085 and HG004726, and by grants to the component studies: PLCO (NCI Intramural Research Program), LAAPC (Cancer Research Fund 99-00524V-10258), both MEC and LAAPC (PC35139, DP000807); MDA (CA68578, CA140388, ES007784, DAMD W81XWH0710645); GECAP (ES011126); CaP Genes (CA88164); IPCG (W81XWH0710122); DCPC (GM08016, DAMD W81XWH0710203, DAMD W81XWH0610066); SCCS (CA092447, CA68485).
AALCC was supported by grants CA060691, CA87895, PC35145 and CA22453, CA68578, CA140388, ES007784, ES06717, CA55769, CA127219, CA1116460S1, CA1116460, CA121197, CA141716, CA121197S2, CPRIT RP100443, CA148127, DAMD W81XWH0710645, University Cancer Foundation, Duncan Family institute, Center for Community, Implementation, and Dissemination Research Core, and by grants to the component studies: PLCO and the Maryland Studies (NCI Intramural Research Program), LAAPC (Cancer Research Fund 99-00524V-10258), and both MEC and LAAPC (PC35139, DP000807).
AUTHOR CONTRIBUTIONSDR and SRM conceived the study. AGH, AT, NP, YS, NR CDP, GKC, KW, SGB, DR and SRM performed analyses. NR performed the experimental work (genotyping of polymorphisms at PRDM9). AGH, NP, JNH, BEH, HAT, ALP, HH, SJC, CAH, JGW, DR and SRM coordinated the study. AGH, DR and SRM wrote the paper. NR, CDP, GKC, KW, SGB, SR, JNH, BEH, HAT, HH, CJC, CAH, JGW, DR and all the alphabetically listed authors contributed to sample collection and generation of SNP array data. All authors contributed to revision and review of the manuscript.
Crossover rate estimates for the AA Map, Pedigree Map, AE Map and S Map can be found at http://www.well.ox.ac.uk/~anjali/AAmap/. We also provide estimates of uncertainty for each map based on samples from the Markov Chain Monte Carlo. Association testing results for each SNP are available from the authors on request.