|Home | About | Journals | Submit | Contact Us | Français|
Tens of millions of base pairs of euchromatic human genome sequence, including many protein-coding genes, have no known location in the human genome. We describe an approach for localizing the human genome's missing pieces by utilizing the patterns of genome sequence variation created by population admixture. We mapped the locations of 70 scaffolds spanning four million base pairs of the human genome's unplaced euchromatic sequence, including more than a dozen protein-coding genes, and identified eight large novel inter-chromosomal segmental duplications. We find that most of these sequences are hidden in the genome's heterochromatin, particularly its pericentromeric regions. Many cryptic, pericentromeric genes are expressed in RNA and have been maintained intact for millions of years while their expression patterns diverged from those of paralogous genes elsewhere in the genome. We describe how knowledge of the locations of these sequences can inform disease association and genome biology studies.
Physical maps of the human genome, including the sequence of most of its euchromatic portions1,2, are basic resources in human genetics and genomics research: they provide the framework for analysis of sequence data; and they enable genome-scale analysis of single nucleotide polymorphisms (SNPs), copy number variants (CNVs), epigenetic phenomena, and gene expression.
Yet physical maps of the human genome remain incomplete. Almost 30 million base pairs (Mbp) of euchromatic genome sequence that are apparently human – observed in human whole-genome sequence data3,4, containing human expressed sequence tags5,6 (ESTs), and homologous to other mammalian genome sequences – are either absent from, or have no assigned locations in, current assemblies of the human genome7,8.
These “missing pieces” of the reference human genome are a likely source of mistaken inference in today's analyses of genome sequence data9. Sequence reads arising from the missing pieces may be discarded as non-alignable, or incorrectly assumed to arise from paralogous sequences in the known, assembled part of the human genome. Sequences missing from the reference human genome might also help answer questions in human genetics research, such as the source of the genetic signals that have been ascertained (but not yet fine-mapped to causal variation or causal genes) by linkage, association, and CNVs.
Here we describe an approach for “admixture mapping” the human genome's missing pieces at megabase pair scales, by utilizing the patterns of sequence variation that have been created by isolation and subsequent re-mixture of human populations. We report the successful mapping of ~5Mbp of unplaced human euchromatic sequences, including many protein-coding genes. We find that most of these sequences are euchromatic islands within the genome's heterochromatic oceans, including centromeres and the short arms of the acrocentric chromosomes, and that they almost always consist of segmental duplications (sometimes recent, sometimes millions of years old) of sequence present elsewhere in the reference genome.
The construction of large-scale genome models (“assemblies”) utilizes physical sequence overlaps between genomic clones10. Clones are assembled, into larger scaffolds based on overlapping sequences at their ends.
By contrast, mapping based on statistical relationships among variants can provide information that is complementary to physical mapping, as it does not require a continuous path of sequences to be cloned and uniquely assembled. Before physical mapping was feasible, linkage among alleles was used to construct the first genetic maps of the human genome based on restriction fragment length polymorphisms11,12, and later to build and improve genetic maps based on microsatellite markers13,14.
A unique kind of long-range information – finer in resolution than linkage in families, yet longer in reach than linkage disequilibrium (LD) in populations – is present in many of the world's admixed populations. Whenever human populations have been reproductively isolated for long periods of time (such as Africans and Europeans) and then re-mixed (such as among African Americans), the genomes of the descendants are mosaics of segments that derive from ancestors from the two ancestral populations (Fig. 1a). The divergence of the sequences in the ancestral populations gives rise to sequence variation that is informative about the ancestry of each segment. Long-range “admixture LD” has been used to map genetic factors that segregate at different frequencies in different populations15,16 and to identify genomic sites of recombination in African Americans17,18.
We reasoned that population admixture could also be used to map the locations of unmapped human genome sequences. Provided that the sequence in a genomic missing piece is variable, that this variation was subject to genetic drift, and that the extent of this drift is known in the two ancestral populations, we could infer the ancestral origin of a missing piece – whether it has been inherited from each individual's European or African ancestors – with varying levels of statistical certainty, in a large panel of admixed individuals. By comparing such ancestry profiles for the genome's missing pieces to similar determinations across the known/mapped/assembled majority of these individuals’ genomes, each missing piece could in principle be connected to the genomic location at which it resides, even if we lack a continuous path of cloned, assembled sequence with which to make such a connection (Fig. 1b).
Specifically, we can test ancestry-informative SNPs for correlation between their genotypes and inferred local ancestry across the genome, estimated using available genome-wide genotypes19. This is different from, and potentially much more powerful than, detecting LD between genotypes at two SNPs, since the correlation between genotypes and local ancestry is expected to be much stronger (than that between SNPs) at genetic distances up to a few cM, and the distance between unmapped “missing pieces” and the nearest parts of the reference genome may be substantial. Furthermore, we estimated statistical mapping power from allele frequencies in the ancestral populations and found that it was substantial, even for admixed population samples of even a few hundred individuals (Supplementary Note and Supplementary Figs. 1–3). Thus, admixture mapping could in principle connect sequences that are physically farther apart than most genomic clones (20-180kbp) and LD blocks (15-50kbp).
We used three sources of unplaced genome sequence: (i) the current reference genome (hg19), which contains 59 unplaced contigs (~5Mbp of euchromatic sequence) for which the correct location is either only known at the chromosomal level or not known at all; (ii) the HuRef genome20, assembled by random shotgun sequencing of a single individual, containing an even larger number of unplaced scaffolds (~3.5Mbp of euchromatic sequence in 28 scaffolds >100kbp and ~7Mbp of euchromatic sequence in 698 scaffolds >10kbp); and (iii) sequence from bacterial artificial chromosome (BAC) and fosmid clones available from GenBank21 (Online Methods).
If an ancestry-informative SNP resides on an unmapped contig, we can map the location of the contig by admixture mapping the SNP. We (i) aligned all unmapped sequence reads from the 1000 Genomes Project22,23 to unplaced scaffolds from HuRef, (ii) identified polymorphic sites across these unplaced sequences, and (iii) computed genotypes at each locus in all European (CEU) and West African (YRI) samples (Online Methods). We selected 314 ancestry-informative SNPs whose genotypes had Pearson correlation r2>15% with local ancestry. We then genotyped these SNPs in a cohort of 380 African American participants from the Jackson Heart Study24 (JHS), selecting this sample size based on initial analyses of the predicted power to map each SNP as function of the number of available genotypes (Online Methods and Supplementary Fig. 3).
We successfully admixture-mapped 139 SNPs (Supplementary Fig. 4 and Supplementary Table 1), assigning locations for 70 previously unlocalized scaffolds (Fig. 2 and Supplementary Table 2). We never observed SNPs from the same scaffold mapping to different locations, as could be the case if the scaffold were itself misassembled. Sequences mapped by this approach comprised a total of ~4Mbp of euchromatic sequence that had not been included or mapped in hg19.
We describe the properties of these mapped locations below.
An additional set of cryptic missing pieces might be missing entirely from human genome reference sequences (i.e. might not even be described as unlocalized sequences, nor present in HuRef) but exist instead as cryptic segmental duplications (or paralogs) of known genomic sequences and have been incorrectly assumed to represent the same genomic sequence as their known paralogs.
We reasoned that admixture mapping could also be used to identify cryptic segmental duplications. A SNP that is annotated in the assembled part of the human genome might in fact exist on a cryptic paralogous sequence elsewhere. Therefore, identification of SNPs that admixture-map to a different genomic location than their annotated location might indicate the presence of these SNPs at another genomic location on a cryptic segmental duplication.
To identify mismapped SNPs we analyzed genome-wide SNP data from two large African American cohorts. Among the 906,703 SNPs from the Affymetrix 6.0 array genotyped in ~7,800 individuals from the Candidate gene Association Resource (CARe) cohort25 and the 566,714 SNPs from the Illumina HumanHap550 array genotyped in ~1,800 individuals from the ICDB cohort, we identified, respectively, 121 and 15 SNPs that admixture-mapped to genomic locations far from their HapMap26 annotations of physical location (Supplementary Note and Supplementary Table 3).
Approximately half these mismapped SNPs belonged to a single region, a ~360kbp segmental duplication from 16q22.2 to 1q21.1 involving the HYDIN gene27–29, confirmed by fluorescence in situ hybridization (FISH), and previously found to give rise to false genome-wide association signals at 16q22.2, that in fact arise from true association at the Duffy locus at 1q23.230 (Supplementary Tables 4 and 5).
Excluding the HYDIN paralog, incorrect mapping for ~30 SNPs can be explained by known segmental duplications31–37, while for the remaining ~40 mismapped SNPs, the most likely explanation is that they reside on sequence missing from the reference genome. (Among the ~30 SNPs that we simply re-mapped from one known segmental duplication copy to another, ten corresponded to sites previously used as single unique nucleotides38 (SUNs) to distinguish among known segmental duplications. By definition, none of the re-mapped SNPs with which we identified novel segmental duplications corresponded to SUNs.)
To understand the relationships between these cryptic paralogs and unplaced scaffolds from large sequencing efforts, we cross-referenced the locations of these SNPs with alignments of unlocalized sequence from HuRef and GenBank. We identified 18 sequences >40kbp each containing one or more of the mismapped SNPs. Twelve of these 18 regions (spanning ~1.3Mbp of euchromatic sequence) could not be explained by segmental duplications already annotated in the reference genome; these indicate the presence of cryptic segmental duplications.
To critically evaluate these findings by an independent method, we utilized the principle that cryptic segmental duplications should give rise, for SNPs called from sequencing data, to excess heterozygosity that does not follow simple models of Hardy Weinberg equilibrium between pairs of alleles. We searched for such a signal – annotated SNPs that behave more like paralogous sequence variants (PSVs) – in data from the 1000 Genomes Project pilot, and were able to confirm all of these regions (Online Methods and Supplementary Table 6). For 8 of these 12 cryptic segmental duplications we could find no mention in the literature. We further confirmed six of them by inter-chromosomal LD analysis using HapMap genotypes (Table 1).
We determined for each region whether the alternate allele for any of the mismapped SNPs was present in any of the BAC clones aligning to that region, by aligning sequences from BAC clones retrieved from GenBank to the hg19 reference genome. For SNPs in six of these regions we could identify BAC clones carrying the alternate allele, suggesting that these clones harbor the sequence where these SNPs actually reside (Table 1). For one of these regions containing the gene PRIM2, further analysis indicated an intra-chromosomal duplication in the pericentromeric region of chromosome 6 and an additional inter-chromosomal duplication in the pericentromeric region of chromosome 3 (Supplementary Note). We confirmed the existence of this triplication by the existence of excess sequence read depth across this region in low-coverage data from the 1000 Genomes Project (Fig. 3a and Supplementary Fig. 5) and fluorescence in situ hybridization (FISH) analysis (Fig. 3b). We also observed that the copy in the reference genome is a hybrid of the two copies on chromosome 6 due to a misassembly (Supplementary Note and Supplementary Fig. 6).
Despite the fact that most of the 300 or so gaps8 in the reference human genome exist in interstitial regions, most of the sequence we were able to localize mapped not to interstitial gaps but to cytogenetically-defined heterochromatic regions of the human genome. Among the mapped scaffolds, 57 of 70 mapped to pericentromeric regions (Fig. 2 and Supplementary Table 2). Among the re-mapped SNPs identifying cryptic segmental duplications, 40 of 70 mapped to pericentromeric regions. (In all these cases the resolution of the mapping was limited to the pericentromeric region identified.)
We sought to confirm these pericentromeric mappings using both published and new cytogenetic data. Among the 70 scaffolds we mapped successfully, 17 were among 29 scaffolds that were previously analyzed by FISH (Supplementary Information of 39 and Supplementary Table S8 of 20). All 17 of these admixture-based mappings were consistent with one of the often multiple locations suggested by FISH (Fig. 2 and Supplementary Table 2). While confirmatory, this result also emphasizes the discerning power of admixture mapping over techniques based on hybridization, as the latter can yield ambiguous results when clones contain segmental duplications or other kinds of repeats.
We also performed additional FISH experiments to critically evaluate the mappings of five novel cryptic paralogous sequences for which no previous FISH data existed. In all (5/5) cases, FISH confirmed the presence of the additional copy in the predicted pericentromeric region (Fig. 4 and Online Methods).
A further prediction of these mappings to pericentromeric regions involves the sequence content of the respective scaffolds. If these genomic missing pieces are indeed euchromatic islands within heterochromatic oceans, then they might frequently contain heterochromatic beaches consisting of the satellite sequences associated with human centromeres. To evaluate this prediction, we measured the amount of sequence classified as heterochromatic satellite on each scaffold. The great majority (50/57) of the scaffolds that admixture-mapped to pericentromeric regions contained more than 5% satellite sequence (Online Methods, Supplementary Fig. 4, and Supplementary Table 2), compared with almost none (1/13) of the scaffolds that admixture-mapped to interstitial regions (p=0.003).
Another prediction of these pericentromeric mappings is that, given earlier data indicating that recombination within centromeres is likely to be heavily repressed40, scaffolds mapping to the same pericentromeric regions might show LD with one another. We identified pairs of SNPs (from distinct scaffolds) with significant linkage disequilibrium not due to admixture, and were able to identify ~500 SNPs from distinct scaffolds mapping to same genomic regions (Supplementary Table 7). In no instance did these LD-based relationships among scaffolds disagree with our mappings from admixture.
To understand how the pericentromeric missing pieces relate to the known human genome, we aligned their sequences to hg19; virtually all scaffolds mapping to pericentromeric regions were found to consist of one or more segmental duplications of mapped euchromatic sequence, with 2-5% sequence divergence (Supplementary Table 2). This suggests that a large fraction of these sequences arrived at their current locations by a process of segmental duplication in primate ancestors41.
Our mapping of these cryptic segmental duplications to centromeric regions is consistent with an earlier finding that most chromosome arms (35 of 43) have an increasing number of known interchromosomal duplications in the proximity of the centromeres42; both results appear to reflect a tendency of interchromosomal duplications to deposit sequence at and around centromeres.
Although the cryptic, pericentromeric euchromatic regions described here have not been purposefully interrogated in earlier CNV studies, they may have been indirectly interrogated via assays that targeted paralogous sequences in the known, assembled parts of the human genome. This seems the likely scenario, as almost all of the mismapped SNPs we identified from genotyping arrays (63 of 70, not including the HYDIN locus) fall within CNVs reported in the Database of Genomic Variants (DGV)43 (Supplementary Table 3), despite the fact that DGV CNVs together cover less than a third of the human genome.
Given the sequence divergence over the identified cryptic paralogs (often greater than 2%), these additional copies are likely to have fixed in the ancestors of all humans. Identifying CNVs over these sequences at a greater rate than for the rest of the genome may therefore indicate the instability of sequences in pericentromeric regions rather than a persistent state of polymorphism of these additional copies in the human population after the duplication event. To evaluate CNV of four selected paralogous region-pairs, we analyzed read depth of coverage and paralogous sequence variation using data from the 1000 Genomes Project (Online Methods). We identified common CNVs affecting the segmental duplications from the 2p22.2, 4q35.2, and DUSP22 loci (Supplementary Figs. 7–9), and we found evidence for CNVs affecting either of the PRIM2 cryptic paralogs (Fig. 3a and Supplementary Fig. 5). In each case we could confirm using PSVs that the cryptic paralogs, rather than the paralogs present in the reference genome, account for the observed copy number variation (Supplementary Note), consistent with CNVs having arisen in the pericentromeric paralogs.
Cryptic, pericentromeric paralogs of known protein-coding genes could in principle be either pseudogenes or expressed, intact genes. To test whether cryptic paralogs of coding genes are expressed at an RNA level, we analyzed RNA-seq data from the Human BodyMap 2.0 project. We focused on reads aligning to the genes DUSP22, PRIM2, HYDIN, MAP2K3, and KCNJ12, all of which appear to have cryptic paralogs in pericentromeric regions (Fig. 3 and Supplementary Figs. 5, 9, and 10). To distinguish RNA arising from reference gene copies from RNA arising from the cryptic paralogs, we focused on reads covering PSVs identifiable from genomic DNA sequence (many of which were previously mis-annotated as SNPs); this makes it extremely likely that sequence differences observed in RNA have a genomic origin (Fig. 5 and Online Methods). We identified expressed RNA for all of the paralogs except MAP2K3 (Fig. 5).
The expression of cryptic, pericentromeric gene copies showed several kinds of relationship to expression of their paralogs. Both DUSP22 and its recently duplicated paralog were expressed and exhibited similar distribution across tissues. In contrast, the cryptic paralogs of PRIM2, which contain only exons 6-14 of the original transcript (as shown in Fig. 3a), give rise to shorter transcripts that are expressed exclusively in brain and testes (Fig. 5). For HYDIN, which is expressed in brain and several other tissues, this analysis indicated that the cryptic paralog at 1q21.1 is expressed in the brain, consistent with its earlier observation in a brain cDNA library28. For KCNJ12 we could detect expression of the paralog KCNJ18 in testis tissue (Supplementary Fig. 11), though we did not detect this KCNJ12 paralog in skeletal muscle where KCNJ18 is known to be expressed at a level sufficient to cause a phenotype44. The tissue specificity observed for paralogous copies is also evidence that these observations are not the result of sequencing errors at putative PSV sites.
These results suggest that many of these cryptic, pericentromeric gene paralogs are expressed genes, and that their expression patterns can differ from those of their known paralogs.
We have described a population-based approach for helping to assemble the rest of the euchromatic human genome, even when missing pieces are separated from known euchromatic sequence by extensive heterochromatic sequence. Because our approach uses data that are widely available or are quickly becoming so, its power will increase quickly in the coming years. We anticipate that this approach will help complete physical maps of the human genome.
Analysis of ancestry-informative markers in unlocalized scaffolds can be used to map the genomic locations of these scaffolds with a physical resolution comparable to that of FISH but with unambiguous mapping to individual loci, and in a highly scalable way that will become inherently more powerful as sequence data sets grow. (Many aspects of the genome assembly will continue to require other methods – for example, our approach does not determine the physical orientation of novel sequence with respect to the chromosome.) Using this approach we mapped ~4Mbp of unplaced euchromatic sequence, most of which we found to be embedded in the heterochromatic regions of the genome. These regions are not included in the current human reference genome and, with two exceptions, they do not overlap with any of the current patches included in the latest revision (Supplementary Table 8).
One limitation of our approach is that it relies on novel sequence having been correctly assembled and distinguished from paralogous sequence. Most sequences from HuRef unplaced scaffolds have a divergence greater than 2% with their closest paralogs; due to limitations of shotgun sequencing assembly, paralogous segments with <2% sequence divergence are likely to be under-represented in human genome assemblies45. Unfortunately, due to their short read lengths, current whole-genome next-generation sequencing approaches do not provide better assemblies for such regions than those obtained with capillary-based sequencing approaches46. Nonetheless, we showed that admixture mapping the SNPs ascertained in such regions can still allow the discovery and mapping of these cryptic paralogous sequences.
Our results have several potential implications for disease gene mapping in humans, particularly wherever genetic signals map near pericentromeric regions, assembly gaps, and segmental duplications.
We showed that CNVs are more common over cryptic paralogs missing from the reference genome, most likely due to the physical instability of pericentromeric regions. We also showed that paralogous genes in these cryptic pericentromeric duplications are transcribed, sometimes with patterns of expression that diverge from those of their paralogs, and therefore potentially serving unique biological functions.
The presence of duplicated regions complicates genome assemblies, SNP and CNV discovery (Supplementary Figs. 12–24). Notably, HYDIN and PRIM2 are among the most difficult genes to reconstruct using de novo assembly from short sequence reads54. PRIM2 and KCNJ12 are among the genes with the largest number of mis-identified non-synonymous SNPs55, most likely due to identification of PSVs as SNPs.
Approximately 6% of the human genome reference is currently considered unreliable for variant discovery by the 1000 Genomes Project23, due to dearth or excess read coverage or poor alignment of sequence reads. Most of the regions we identified as harboring a cryptic segmental duplication (Table 1 and Supplementary Table 6) fall in this inaccessible part of the human genome. While waiting for a more complete version of the human genome reference, the 1000 Genomes Project now aligns sequence data to an expanded genome reference that includes additional unlocalized sequences (termed “decoy sequences”), to reduce false alignments in regions with cryptic segmental duplications. These additional sequences consist mainly of sequenced clones discarded by the Human Genome Project and sequence from the HuRef assembly (~30% of decoy sequences consist of HuRef unlocalized scaffolds). Of course, the eventual goal of such projects will be the alignment of all human sequence reads to their actual physical locations.
In completing maps of the human genome, the important remaining challenges include mapping the human genome's structure at all scales, fully cataloging the genome's sequence content, and appreciating how sequences are ordered and arranged along chromosomes. As the scientific community works toward a complete reference assembly of the human genome56, analysis of genome-wide data from admixed populations will add unique value and help complete our understanding of the human genome's structure and evolution.
URLs. HuRef unplaced scaffolds, ftp://ftp.tigr.org/pub/data/huref/
GenBank database: ftp://ftp.ncbi.nih.gov/genbank/
Database of Genotypes and Phenotypes (dbGaP): http://www.ncbi.nlm.nih.gov/gap
Illumina iControlDB: http://www.illumina.com/science/icontroldb.ilmn
HapMap inter-chromosomal LD: ftp://ftp.ncbi.nlm.nih.gov/hapmap/inter_chr_ld/
Illumina Human BodyMap 2.0 data: http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE30611
To align the HuRef genome and sequenced BAC and fosmid clones to the hg19 reference genome we first downloaded all available sequence from The Institute for Genomic Research and GenBank websites (by downloading the scaffold-not-in-chromosome.fasta file from ftp://ftp.tigr.org/pub/data/huref/ and all gbpri* files from ftp://ftp.ncbi.nih.gov/genbank/), and then used BWA57 (using bwa bwasw) for the alignments against hg19. We identified repeats classified as satellite sequences over HuRef unplaced scaffolds using RepeatMasker58. Satellite sequence consists of large arrays of tandemly repeated units of non-coding DNA. The amount of satellite and missing sequence is reported for each unplaced scaffold (Supplementary Fig. 4 and Supplementary Table 2). To identify within these resources the presence of cryptic segmental duplications, that is, sequence missing from the current reference genome but present in a diverged duplicated form, we aligned all available contigs from HuRef and GenBank clones against hg19 (Supplementary Table 2).
For genotyping from sequence reads, we selected all the CEU and YRI samples available in the 1000 Genomes Project22,23. Unmapped reads were aligned against the HuRef unplaced scaffolds using BWA59 (using bwa aln/sampe). Genotype calling in the unplaced contigs was performed using the Genome Analysis Toolkit60 (GATK), using default settings for the UnifiedGenotyper walker.
To map the location of a SNP, the genotypes were first adjusted by regressing for amount of global West African ancestry for each sample. The adjusted genotypes were then tested for correlation with local ancestry across the genome using a one-tailed Pearson correlation test. If the correlation of the genotypes with global West African ancestry is positive, a right-tailed test is chosen, otherwise a left-tailed test is chosen. The location corresponding to the smallest p-value is then recorded for each SNP, together with the location corresponding to the smallest p-value in a different chromosome. All these steps were performed using custom MATLAB (2011b, The MathWorks, Natick, MA) scripts.
It is intuitive to expect that genotyping SNPs over paralogous sequences, only one of which will be expected to be polymorphic, will be often incorrect as it won't be possible to correctly infer the homozygous state for the alternate allele, leading to failure of Hardy-Weinberg equilibrium among other things. This is not always so for genotyping arrays however, as genotyping of SNPs is often based on a two-dimensional Gaussian mixture model over summarized probe intensities for each of the two alleles61, enabling the correct distinction of the three possible genotypes even without modeling the presence of a cryptic paralog.
From all detected SNPs in hg19 unplaced contigs and HuRef unplaced scaffolds we filtered out SNPs at loci for which the number of reads with mapping quality 0 was at least 4 and at least 10%. We also filtered out clusters of four SNPs within a window size of 10bp. The rationale is that in loci with ambiguous alignment, it is possible to call SNPs which actually belong to a paralogous region of the genome. Variants called in loci where many SNPs cluster together have a higher chance to be an artifact of misaligned reads originating from paralogous regions that are not present in the reference genome used for the alignment. This methodology maximizes the chances that a SNP belongs to the unplaced scaffold where it is called. Among the filtered list, up to 7 ancestry-informative SNPs were chosen for each contig for which genotype was estimated to have Pearson correlation coefficient with amount of local European ancestry satisfying r2>15%. SNPs were further filtered to fit within 10 Sequenom plexes, prioritizing degree of correlation with ancestry. We selected 380 samples from the Jackson and Heart Study (JHS)24, which had been genotyped with the Affymetrix 6.0 array and analyzed with HAPMIX62. To achieve the maximum possible mapping resolution, we exclusively selected samples with at least 62 detected crossovers between ancestries (maximum was 115).
Most likely due to the repetitiveness of the flanking sequences for which primers were designed, 86 assays failed completely; of the remainder, 53 failed the Hardy-Weinberg equilibrium test (p<10-6), and 175 passed. Nevertheless, we could still reliably identify the location of 139 SNPs (Pearson correlation test p<10-6), 106 of which had passed and 33 of which had failed Hardy-Weinberg equilibrium test, showing that SNPs with unreliable genotypes can still be informative for mapping purposes (Supplementary Fig. 4 and Supplementary Table 1). By analyzing for each successfully mapped SNP the best correlation between adjusted genotype and local ancestry on chromosomes other than the one where the SNP mapped, we estimated that the selected conservative threshold of 10-6 for the p-value gives a false discovery rate lower than 1%.
To identify regions with an excess of PSVs suggesting the presence of large cryptic segmental duplications, we searched for SNPs across the reference genome whose probabilistic genotype from 1000 Genomes Project pilot low-pass sequencing data failed the Hardy-Weinberg equilibrium test63 (using bcftools view -c). We identified variants that failed the equilibrium test (p<10-6) in CEU and YRI samples, grouped them together if they were <5kbp apart (using custom MATLAB scripts), and listed all resulting regions >40kbp (Supplementary Table 6).
Peripheral blood mononuclear cells were stimulated with phytohemagglutinin and harvested. Metaphase spreads were prepared by standard protocols. Fosmid clones spanning the regions of interest were selected for FISH mapping using the University of California Santa Cruz (UCSC) Genome Browser (http://genome.ucsc.edu/). Fosmids were labeled with either SpectrumOrange or SpectrumGreen conjugated dUTP using a nick translation kit (Abbott Molecular, Des Plaines, IL). Labeled pairs were hybridized overnight to metaphase chromosome preparations. Following 4x SSC/0.1% Tween, 2xSSC/0.3% Tween, and phosphate-buffered detergent washes, chromosomes were counterstained with DAPI and analyzed by epi-fluorescence with a Zeiss Axioplan2 microscope (Thornwood, NY) and Applied Imaging CytoVision software (Santa Clara, CA).
To assess the copy number variability of the missing reference segments, we used an updated version of Genome STRiP64 to analyze read depth. Normalized read depth was measured by comparing the number of DNA fragments with sequencing reads aligned to the reference genome in a given region to the expected read depth per haploid copy based on (i) the total sequencing depth for each sample, (ii) the alignability of each position, based on whether it would be uniquely mapped by a perfect 36bp read, and (iii) sequencing bias due to GC content.
We performed normalization for GC-bias empirically, similar to 38. We first identified a 588Mbp subset of the autosomal reference sequence with no known evidence of copy number variation to use as a baseline. We removed all positions within 200bp of annotated CNV regions listed in DGV, segmental duplications listed in the UCSC browser, repeats annotated by RepeatMasker58 and assembly gaps, yielding a subset that is highly likely to be copy number invariant in the majority of people. This reference subset is divided into 400bp windows, stratified by GC fraction within each window, and the observed read depth at each GC fraction is compared to the total read depth across all windows to yield a GC-normalization curve for each sequencing library.
Given a genomic locus, estimation of diploid copy number for each sample was done by fitting a Gaussian mixture model with sample-specific variance to the observed and expected read depth for each sample64, allowing the model to fit as many copy number classes as needed at each locus.
To analyze genome regions with known paralogs in sequences not in the hg19 reference (notably 2p22.2), we used BWA59 (using bwa aln/sampe) to realign the 1000 Genomes reads from the genomic region to a synthetic reference containing the original reference sequence plus the sequence for the extra paralog. Estimation of copy number was then carried out as described above.
To compare expression of different paralogs of genes DUSP22, PRIM2, HYDIN, MAP2K3, and KCNJ12, we first identified PSVs over the predicted mRNA for these genes looking at all heterozygous loci called for 1000 Genomes Project pilot high coverage samples NA12878/CEU, NA12891/CEU, NA12892/CEU, NA19238/YRI, NA19239/YRI, and NA19240/YRI and then determined, when possible, which allele belongs to each paralog (Supplementary Tables 9–13). Once we obtained a list of all PSVs, we counted reads from the Illumina Human BodyMap 2.0 project for each of the alleles observed at the locus using the GATK60 (using default settings for the UnifiedGenotyper walker and custom scripts). To validate the findings and filter out possible artifacts, sequence reads were further manually analyzed using the Integrative genomics viewer65 (IGV).
This study was supported by grants RC1 GM091332-01 (S.A.M. and J.G.W.) and R01DK54931 (G.G. and M.R.P.) from the National Institutes of Health, and by a Smith Family Foundation Award for Excellence in Biomedical Research (S.A.M.).
The Jackson Heart Study is supported and conducted in collaboration with Jackson State University (N01-HC-95170), University of Mississippi Medical Center (N01-HC-95171), and Touglaoo College (N01-HC-95172) contracts from the National Heart, Lung, and Blood Institute (NHLBI) and the National Institute for Minority Health and Health Disparities (NIMHD), with additional support from the National Institute on Biomedical Imaging and Bioengineering (NIBIB).
The Atherosclerosis Risk in Communities Study is carried out as a collaborative study supported by National Heart, Lung, and Blood Institute contracts (HHSN268201100005C, HHSN268201100006C, HHSN268201100007C, HHSN268201100008C, HHSN268201100009C, HHSN268201100010C, HHSN268201100011C, and HHSN268201100012C).
The Coronary Artery Risk Development in Young Adults Study (CARDIA) is conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with the University of Alabama at Birmingham (N01-HC95095 & N01-HC48047), University of Minnesota (N01-HC48048), Northwestern University (N01-HC48049), and Kaiser Foundation Research Institute (N01-HC48050).
MESA, MESA Family, and the MESA SHARe project are conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with MESA investigators. Support for MESA is provided by contracts N01-HC-95159 through N01-HC-95169 and RR-024156. Funding for MESA Family is provided by grants R01-HL-071051, R01-HL-071205, R01-HL-071250, R01-HL-071251, R01-HL-071252, R01-HL-071258, R01-HL-071259. MESA Air is funded by the US EPA - Science to Achieve Results (STAR) Program Grant # RD831697. Funding for genotyping was provided by NHLBI Contract N02-HL-6-4278 and N01-HC-65226. This manuscript does not necessarily reflect the opinions or views of ARIC, CARDIA, JHS, MESA, or the NHLBI.
G.G. and S.A.M. conceived the project, designed the analyses and wrote the manuscript. G.G. performed the analysis of the CARe, ICDB, JHS, and BodyMap 2.0 datasets. R.E.H. performed the sequence read depth analysis of selected regions. H.L. performed the alignments of HuRef scaffolds and GenBank clones. N.A. contributed the analysis of the HuRef unplaced scaffolds. A.M.L. performed the FISH experiments. K.C. organized and contributed to the design of the Sequenom experiment. B.P., A.P., and D.R. provided advice for the local ancestry inference. C.C.M. participated in interpretation of the FISH experiments. M.R.P. participated in planning discussions for the linkage analysis. J.G.W. participated in planning discussions, coordinated interactions with the Jackson Heart Study, and edited the manuscript.
COMPETING FINANCIAL INTERESTS
The authors declare no competing financial interests.