PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
Nat Genet. Author manuscript; available in PMC 2013 October 1.
Published in final edited form as:
Published online 2013 February 24. doi:  10.1038/ng.2565
PMCID: PMC3683849
NIHMSID: NIHMS474889

Using population admixture to help complete maps of the human genome

Abstract

Tens of millions of base pairs of euchromatic human genome sequence, including many protein-coding genes, have no known location in the human genome. We describe an approach for localizing the human genome's missing pieces by utilizing the patterns of genome sequence variation created by population admixture. We mapped the locations of 70 scaffolds spanning four million base pairs of the human genome's unplaced euchromatic sequence, including more than a dozen protein-coding genes, and identified eight large novel inter-chromosomal segmental duplications. We find that most of these sequences are hidden in the genome's heterochromatin, particularly its pericentromeric regions. Many cryptic, pericentromeric genes are expressed in RNA and have been maintained intact for millions of years while their expression patterns diverged from those of paralogous genes elsewhere in the genome. We describe how knowledge of the locations of these sequences can inform disease association and genome biology studies.

Physical maps of the human genome, including the sequence of most of its euchromatic portions1,2, are basic resources in human genetics and genomics research: they provide the framework for analysis of sequence data; and they enable genome-scale analysis of single nucleotide polymorphisms (SNPs), copy number variants (CNVs), epigenetic phenomena, and gene expression.

Yet physical maps of the human genome remain incomplete. Almost 30 million base pairs (Mbp) of euchromatic genome sequence that are apparently human – observed in human whole-genome sequence data3,4, containing human expressed sequence tags5,6 (ESTs), and homologous to other mammalian genome sequences – are either absent from, or have no assigned locations in, current assemblies of the human genome7,8.

These “missing pieces” of the reference human genome are a likely source of mistaken inference in today's analyses of genome sequence data9. Sequence reads arising from the missing pieces may be discarded as non-alignable, or incorrectly assumed to arise from paralogous sequences in the known, assembled part of the human genome. Sequences missing from the reference human genome might also help answer questions in human genetics research, such as the source of the genetic signals that have been ascertained (but not yet fine-mapped to causal variation or causal genes) by linkage, association, and CNVs.

Here we describe an approach for “admixture mapping” the human genome's missing pieces at megabase pair scales, by utilizing the patterns of sequence variation that have been created by isolation and subsequent re-mixture of human populations. We report the successful mapping of ~5Mbp of unplaced human euchromatic sequences, including many protein-coding genes. We find that most of these sequences are euchromatic islands within the genome's heterochromatic oceans, including centromeres and the short arms of the acrocentric chromosomes, and that they almost always consist of segmental duplications (sometimes recent, sometimes millions of years old) of sequence present elsewhere in the reference genome.

An approach for admixture mapping unplaced sequence

The construction of large-scale genome models (“assemblies”) utilizes physical sequence overlaps between genomic clones10. Clones are assembled, into larger scaffolds based on overlapping sequences at their ends.

By contrast, mapping based on statistical relationships among variants can provide information that is complementary to physical mapping, as it does not require a continuous path of sequences to be cloned and uniquely assembled. Before physical mapping was feasible, linkage among alleles was used to construct the first genetic maps of the human genome based on restriction fragment length polymorphisms11,12, and later to build and improve genetic maps based on microsatellite markers13,14.

A unique kind of long-range information – finer in resolution than linkage in families, yet longer in reach than linkage disequilibrium (LD) in populations – is present in many of the world's admixed populations. Whenever human populations have been reproductively isolated for long periods of time (such as Africans and Europeans) and then re-mixed (such as among African Americans), the genomes of the descendants are mosaics of segments that derive from ancestors from the two ancestral populations (Fig. 1a). The divergence of the sequences in the ancestral populations gives rise to sequence variation that is informative about the ancestry of each segment. Long-range “admixture LD” has been used to map genetic factors that segregate at different frequencies in different populations15,16 and to identify genomic sites of recombination in African Americans17,18.

Figure 1
Admixture mapping the human genome's missing pieces. (a) Chromosomes of West African descent (red) have recombined with chromosomes of European descent (blue) through admixture to form mosaic genomes in African Americans. (b) Localization of genomic “missing ...

We reasoned that population admixture could also be used to map the locations of unmapped human genome sequences. Provided that the sequence in a genomic missing piece is variable, that this variation was subject to genetic drift, and that the extent of this drift is known in the two ancestral populations, we could infer the ancestral origin of a missing piece – whether it has been inherited from each individual's European or African ancestors – with varying levels of statistical certainty, in a large panel of admixed individuals. By comparing such ancestry profiles for the genome's missing pieces to similar determinations across the known/mapped/assembled majority of these individuals’ genomes, each missing piece could in principle be connected to the genomic location at which it resides, even if we lack a continuous path of cloned, assembled sequence with which to make such a connection (Fig. 1b).

Specifically, we can test ancestry-informative SNPs for correlation between their genotypes and inferred local ancestry across the genome, estimated using available genome-wide genotypes19. This is different from, and potentially much more powerful than, detecting LD between genotypes at two SNPs, since the correlation between genotypes and local ancestry is expected to be much stronger (than that between SNPs) at genetic distances up to a few cM, and the distance between unmapped “missing pieces” and the nearest parts of the reference genome may be substantial. Furthermore, we estimated statistical mapping power from allele frequencies in the ancestral populations and found that it was substantial, even for admixed population samples of even a few hundred individuals (Supplementary Note and Supplementary Figs. 1–3). Thus, admixture mapping could in principle connect sequences that are physically farther apart than most genomic clones (20-180kbp) and LD blocks (15-50kbp).

RESULTS

Sources of the missing pieces

We used three sources of unplaced genome sequence: (i) the current reference genome (hg19), which contains 59 unplaced contigs (~5Mbp of euchromatic sequence) for which the correct location is either only known at the chromosomal level or not known at all; (ii) the HuRef genome20, assembled by random shotgun sequencing of a single individual, containing an even larger number of unplaced scaffolds (~3.5Mbp of euchromatic sequence in 28 scaffolds >100kbp and ~7Mbp of euchromatic sequence in 698 scaffolds >10kbp); and (iii) sequence from bacterial artificial chromosome (BAC) and fosmid clones available from GenBank21 (Online Methods).

Mapping the human genome's missing pieces

If an ancestry-informative SNP resides on an unmapped contig, we can map the location of the contig by admixture mapping the SNP. We (i) aligned all unmapped sequence reads from the 1000 Genomes Project22,23 to unplaced scaffolds from HuRef, (ii) identified polymorphic sites across these unplaced sequences, and (iii) computed genotypes at each locus in all European (CEU) and West African (YRI) samples (Online Methods). We selected 314 ancestry-informative SNPs whose genotypes had Pearson correlation r2>15% with local ancestry. We then genotyped these SNPs in a cohort of 380 African American participants from the Jackson Heart Study24 (JHS), selecting this sample size based on initial analyses of the predicted power to map each SNP as function of the number of available genotypes (Online Methods and Supplementary Fig. 3).

We successfully admixture-mapped 139 SNPs (Supplementary Fig. 4 and Supplementary Table 1), assigning locations for 70 previously unlocalized scaffolds (Fig. 2 and Supplementary Table 2). We never observed SNPs from the same scaffold mapping to different locations, as could be the case if the scaffold were itself misassembled. Sequences mapped by this approach comprised a total of ~4Mbp of euchromatic sequence that had not been included or mapped in hg19.

Figure 2
Approximate locations of previously unplaced genome-sequence scaffolds that were mapped by our approach. Contigs from hg19 are labeled with three digits and stand for GL000###, while scaffolds from HuRef are labeled with five digits and stand for SCAF_11032791#####. ...

We describe the properties of these mapped locations below.

Identifying additional, cryptic missing pieces

An additional set of cryptic missing pieces might be missing entirely from human genome reference sequences (i.e. might not even be described as unlocalized sequences, nor present in HuRef) but exist instead as cryptic segmental duplications (or paralogs) of known genomic sequences and have been incorrectly assumed to represent the same genomic sequence as their known paralogs.

We reasoned that admixture mapping could also be used to identify cryptic segmental duplications. A SNP that is annotated in the assembled part of the human genome might in fact exist on a cryptic paralogous sequence elsewhere. Therefore, identification of SNPs that admixture-map to a different genomic location than their annotated location might indicate the presence of these SNPs at another genomic location on a cryptic segmental duplication.

To identify mismapped SNPs we analyzed genome-wide SNP data from two large African American cohorts. Among the 906,703 SNPs from the Affymetrix 6.0 array genotyped in ~7,800 individuals from the Candidate gene Association Resource (CARe) cohort25 and the 566,714 SNPs from the Illumina HumanHap550 array genotyped in ~1,800 individuals from the ICDB cohort, we identified, respectively, 121 and 15 SNPs that admixture-mapped to genomic locations far from their HapMap26 annotations of physical location (Supplementary Note and Supplementary Table 3).

Approximately half these mismapped SNPs belonged to a single region, a ~360kbp segmental duplication from 16q22.2 to 1q21.1 involving the HYDIN gene2729, confirmed by fluorescence in situ hybridization (FISH), and previously found to give rise to false genome-wide association signals at 16q22.2, that in fact arise from true association at the Duffy locus at 1q23.230 (Supplementary Tables 4 and 5).

Excluding the HYDIN paralog, incorrect mapping for ~30 SNPs can be explained by known segmental duplications3137, while for the remaining ~40 mismapped SNPs, the most likely explanation is that they reside on sequence missing from the reference genome. (Among the ~30 SNPs that we simply re-mapped from one known segmental duplication copy to another, ten corresponded to sites previously used as single unique nucleotides38 (SUNs) to distinguish among known segmental duplications. By definition, none of the re-mapped SNPs with which we identified novel segmental duplications corresponded to SUNs.)

To understand the relationships between these cryptic paralogs and unplaced scaffolds from large sequencing efforts, we cross-referenced the locations of these SNPs with alignments of unlocalized sequence from HuRef and GenBank. We identified 18 sequences >40kbp each containing one or more of the mismapped SNPs. Twelve of these 18 regions (spanning ~1.3Mbp of euchromatic sequence) could not be explained by segmental duplications already annotated in the reference genome; these indicate the presence of cryptic segmental duplications.

To critically evaluate these findings by an independent method, we utilized the principle that cryptic segmental duplications should give rise, for SNPs called from sequencing data, to excess heterozygosity that does not follow simple models of Hardy Weinberg equilibrium between pairs of alleles. We searched for such a signal – annotated SNPs that behave more like paralogous sequence variants (PSVs) – in data from the 1000 Genomes Project pilot, and were able to confirm all of these regions (Online Methods and Supplementary Table 6). For 8 of these 12 cryptic segmental duplications we could find no mention in the literature. We further confirmed six of them by inter-chromosomal LD analysis using HapMap genotypes (Table 1).

Table 1
Segmental duplications localized by admixture mapping.

We determined for each region whether the alternate allele for any of the mismapped SNPs was present in any of the BAC clones aligning to that region, by aligning sequences from BAC clones retrieved from GenBank to the hg19 reference genome. For SNPs in six of these regions we could identify BAC clones carrying the alternate allele, suggesting that these clones harbor the sequence where these SNPs actually reside (Table 1). For one of these regions containing the gene PRIM2, further analysis indicated an intra-chromosomal duplication in the pericentromeric region of chromosome 6 and an additional inter-chromosomal duplication in the pericentromeric region of chromosome 3 (Supplementary Note). We confirmed the existence of this triplication by the existence of excess sequence read depth across this region in low-coverage data from the 1000 Genomes Project (Fig. 3a and Supplementary Fig. 5) and fluorescence in situ hybridization (FISH) analysis (Fig. 3b). We also observed that the copy in the reference genome is a hybrid of the two copies on chromosome 6 due to a misassembly (Supplementary Note and Supplementary Fig. 6).

Figure 3
Cryptic paralogs of the PRIM2 gene. (a) Analysis of sequencing coverage depth in data from the 1000 Genomes Project suggests the presence of three segments (blue arrows) with elevated copy number. While the copy number of each segment appears to be fixed ...

Pericentromeric locations of the missing pieces

Despite the fact that most of the 300 or so gaps8 in the reference human genome exist in interstitial regions, most of the sequence we were able to localize mapped not to interstitial gaps but to cytogenetically-defined heterochromatic regions of the human genome. Among the mapped scaffolds, 57 of 70 mapped to pericentromeric regions (Fig. 2 and Supplementary Table 2). Among the re-mapped SNPs identifying cryptic segmental duplications, 40 of 70 mapped to pericentromeric regions. (In all these cases the resolution of the mapping was limited to the pericentromeric region identified.)

We sought to confirm these pericentromeric mappings using both published and new cytogenetic data. Among the 70 scaffolds we mapped successfully, 17 were among 29 scaffolds that were previously analyzed by FISH (Supplementary Information of 39 and Supplementary Table S8 of 20). All 17 of these admixture-based mappings were consistent with one of the often multiple locations suggested by FISH (Fig. 2 and Supplementary Table 2). While confirmatory, this result also emphasizes the discerning power of admixture mapping over techniques based on hybridization, as the latter can yield ambiguous results when clones contain segmental duplications or other kinds of repeats.

We also performed additional FISH experiments to critically evaluate the mappings of five novel cryptic paralogous sequences for which no previous FISH data existed. In all (5/5) cases, FISH confirmed the presence of the additional copy in the predicted pericentromeric region (Fig. 4 and Online Methods).

Figure 4
FISH analysis confirmed the presence of cryptic segmental duplications. (a) Fosmid clone WI2-1750D05 (G248P87673B3 aligned to chr2:133,062,362-133,104,847); (b) WI2-1656E10 (G248P83226C5 aligned to chr3:613,680-650,737) hybridized to the centromeric/acrocentric ...

A further prediction of these mappings to pericentromeric regions involves the sequence content of the respective scaffolds. If these genomic missing pieces are indeed euchromatic islands within heterochromatic oceans, then they might frequently contain heterochromatic beaches consisting of the satellite sequences associated with human centromeres. To evaluate this prediction, we measured the amount of sequence classified as heterochromatic satellite on each scaffold. The great majority (50/57) of the scaffolds that admixture-mapped to pericentromeric regions contained more than 5% satellite sequence (Online Methods, Supplementary Fig. 4, and Supplementary Table 2), compared with almost none (1/13) of the scaffolds that admixture-mapped to interstitial regions (p=0.003).

Another prediction of these pericentromeric mappings is that, given earlier data indicating that recombination within centromeres is likely to be heavily repressed40, scaffolds mapping to the same pericentromeric regions might show LD with one another. We identified pairs of SNPs (from distinct scaffolds) with significant linkage disequilibrium not due to admixture, and were able to identify ~500 SNPs from distinct scaffolds mapping to same genomic regions (Supplementary Table 7). In no instance did these LD-based relationships among scaffolds disagree with our mappings from admixture.

To understand how the pericentromeric missing pieces relate to the known human genome, we aligned their sequences to hg19; virtually all scaffolds mapping to pericentromeric regions were found to consist of one or more segmental duplications of mapped euchromatic sequence, with 2-5% sequence divergence (Supplementary Table 2). This suggests that a large fraction of these sequences arrived at their current locations by a process of segmental duplication in primate ancestors41.

Our mapping of these cryptic segmental duplications to centromeric regions is consistent with an earlier finding that most chromosome arms (35 of 43) have an increasing number of known interchromosomal duplications in the proximity of the centromeres42; both results appear to reflect a tendency of interchromosomal duplications to deposit sequence at and around centromeres.

Are the missing pieces copy number variable?

Although the cryptic, pericentromeric euchromatic regions described here have not been purposefully interrogated in earlier CNV studies, they may have been indirectly interrogated via assays that targeted paralogous sequences in the known, assembled parts of the human genome. This seems the likely scenario, as almost all of the mismapped SNPs we identified from genotyping arrays (63 of 70, not including the HYDIN locus) fall within CNVs reported in the Database of Genomic Variants (DGV)43 (Supplementary Table 3), despite the fact that DGV CNVs together cover less than a third of the human genome.

Given the sequence divergence over the identified cryptic paralogs (often greater than 2%), these additional copies are likely to have fixed in the ancestors of all humans. Identifying CNVs over these sequences at a greater rate than for the rest of the genome may therefore indicate the instability of sequences in pericentromeric regions rather than a persistent state of polymorphism of these additional copies in the human population after the duplication event. To evaluate CNV of four selected paralogous region-pairs, we analyzed read depth of coverage and paralogous sequence variation using data from the 1000 Genomes Project (Online Methods). We identified common CNVs affecting the segmental duplications from the 2p22.2, 4q35.2, and DUSP22 loci (Supplementary Figs. 7–9), and we found evidence for CNVs affecting either of the PRIM2 cryptic paralogs (Fig. 3a and Supplementary Fig. 5). In each case we could confirm using PSVs that the cryptic paralogs, rather than the paralogs present in the reference genome, account for the observed copy number variation (Supplementary Note), consistent with CNVs having arisen in the pericentromeric paralogs.

Expression of protein-coding genes from pericentromeric regions

Cryptic, pericentromeric paralogs of known protein-coding genes could in principle be either pseudogenes or expressed, intact genes. To test whether cryptic paralogs of coding genes are expressed at an RNA level, we analyzed RNA-seq data from the Human BodyMap 2.0 project. We focused on reads aligning to the genes DUSP22, PRIM2, HYDIN, MAP2K3, and KCNJ12, all of which appear to have cryptic paralogs in pericentromeric regions (Fig. 3 and Supplementary Figs. 5, 9, and 10). To distinguish RNA arising from reference gene copies from RNA arising from the cryptic paralogs, we focused on reads covering PSVs identifiable from genomic DNA sequence (many of which were previously mis-annotated as SNPs); this makes it extremely likely that sequence differences observed in RNA have a genomic origin (Fig. 5 and Online Methods). We identified expressed RNA for all of the paralogs except MAP2K3 (Fig. 5).

Figure 5
Expression of cryptic gene paralogs from pericentromeric regions of the human genome. Paralogous sequence variants were used to distinguish expression of the DUSP22, PRIM2, HYDIN, and MAP2K3 genes and from expression of thier cryptic paralogs, in RNA-seq ...

The expression of cryptic, pericentromeric gene copies showed several kinds of relationship to expression of their paralogs. Both DUSP22 and its recently duplicated paralog were expressed and exhibited similar distribution across tissues. In contrast, the cryptic paralogs of PRIM2, which contain only exons 6-14 of the original transcript (as shown in Fig. 3a), give rise to shorter transcripts that are expressed exclusively in brain and testes (Fig. 5). For HYDIN, which is expressed in brain and several other tissues, this analysis indicated that the cryptic paralog at 1q21.1 is expressed in the brain, consistent with its earlier observation in a brain cDNA library28. For KCNJ12 we could detect expression of the paralog KCNJ18 in testis tissue (Supplementary Fig. 11), though we did not detect this KCNJ12 paralog in skeletal muscle where KCNJ18 is known to be expressed at a level sufficient to cause a phenotype44. The tissue specificity observed for paralogous copies is also evidence that these observations are not the result of sequencing errors at putative PSV sites.

These results suggest that many of these cryptic, pericentromeric gene paralogs are expressed genes, and that their expression patterns can differ from those of their known paralogs.

DISCUSSION

We have described a population-based approach for helping to assemble the rest of the euchromatic human genome, even when missing pieces are separated from known euchromatic sequence by extensive heterochromatic sequence. Because our approach uses data that are widely available or are quickly becoming so, its power will increase quickly in the coming years. We anticipate that this approach will help complete physical maps of the human genome.

Analysis of ancestry-informative markers in unlocalized scaffolds can be used to map the genomic locations of these scaffolds with a physical resolution comparable to that of FISH but with unambiguous mapping to individual loci, and in a highly scalable way that will become inherently more powerful as sequence data sets grow. (Many aspects of the genome assembly will continue to require other methods – for example, our approach does not determine the physical orientation of novel sequence with respect to the chromosome.) Using this approach we mapped ~4Mbp of unplaced euchromatic sequence, most of which we found to be embedded in the heterochromatic regions of the genome. These regions are not included in the current human reference genome and, with two exceptions, they do not overlap with any of the current patches included in the latest revision (Supplementary Table 8).

One limitation of our approach is that it relies on novel sequence having been correctly assembled and distinguished from paralogous sequence. Most sequences from HuRef unplaced scaffolds have a divergence greater than 2% with their closest paralogs; due to limitations of shotgun sequencing assembly, paralogous segments with <2% sequence divergence are likely to be under-represented in human genome assemblies45. Unfortunately, due to their short read lengths, current whole-genome next-generation sequencing approaches do not provide better assemblies for such regions than those obtained with capillary-based sequencing approaches46. Nonetheless, we showed that admixture mapping the SNPs ascertained in such regions can still allow the discovery and mapping of these cryptic paralogous sequences.

Our results have several potential implications for disease gene mapping in humans, particularly wherever genetic signals map near pericentromeric regions, assembly gaps, and segmental duplications.

  • Copy number variations (CNVs) frequently straddle or are flanked by ambiguous regions of the genome assembly. For example, deletions and duplications at 1q21.1 reported to affect ~1.5Mbp of genomic sequence associate with cardiac developmental defects47, schizophrenia48,49, mental retardation, autism, congenital anomalies50, and abnormal head size51. Fully defining the gene content of these CNVs will require interrogating the missing sequence hidden in the assembly gaps at 1q21.1.
  • Some regions implicated in genome-wide association studies may require re-analysis in light of the results here. For example, human height associates with rs17511102 and other markers in a lincRNA-containing segment of 2p22.252 for which we found a cryptic segmental duplication (and paralogous lincRNA) in the pericentromeric region of chromosome 22. Following up this association will require that markers throughout the region be re-assigned to the correct paralogous gene copies.
  • Gene SERPINB6 was associated with a clinical phenotype through homozygosity mapping by the identification of an homozygous region terminated by the heterozygous genotype of SNP rs776281153, which our results suggest being incorrectly assigned to 6p25.3 while in fact residing at 16p11.2, leading to a slight underestimation of the correct homozygous region.
  • The genes affected by cryptic segmental duplications may be functionally important and critical to include and explicitly model in exome sequencing studies. For example mutations in KCNJ18, a gene missing from the reference genome, have been shown to cause thyrotoxic hypokalemic periodic paralysis44.
  • An admixture mapping study found that African Americans with multiple sclerosis have elevated European ancestry around the centromere of chromosome 115, a region to which our work has assigned more than a megabase of novel sequence.

We showed that CNVs are more common over cryptic paralogs missing from the reference genome, most likely due to the physical instability of pericentromeric regions. We also showed that paralogous genes in these cryptic pericentromeric duplications are transcribed, sometimes with patterns of expression that diverge from those of their paralogs, and therefore potentially serving unique biological functions.

The presence of duplicated regions complicates genome assemblies, SNP and CNV discovery (Supplementary Figs. 12–24). Notably, HYDIN and PRIM2 are among the most difficult genes to reconstruct using de novo assembly from short sequence reads54. PRIM2 and KCNJ12 are among the genes with the largest number of mis-identified non-synonymous SNPs55, most likely due to identification of PSVs as SNPs.

Approximately 6% of the human genome reference is currently considered unreliable for variant discovery by the 1000 Genomes Project23, due to dearth or excess read coverage or poor alignment of sequence reads. Most of the regions we identified as harboring a cryptic segmental duplication (Table 1 and Supplementary Table 6) fall in this inaccessible part of the human genome. While waiting for a more complete version of the human genome reference, the 1000 Genomes Project now aligns sequence data to an expanded genome reference that includes additional unlocalized sequences (termed “decoy sequences”), to reduce false alignments in regions with cryptic segmental duplications. These additional sequences consist mainly of sequenced clones discarded by the Human Genome Project and sequence from the HuRef assembly (~30% of decoy sequences consist of HuRef unlocalized scaffolds). Of course, the eventual goal of such projects will be the alignment of all human sequence reads to their actual physical locations.

In completing maps of the human genome, the important remaining challenges include mapping the human genome's structure at all scales, fully cataloging the genome's sequence content, and appreciating how sequences are ordered and arranged along chromosomes. As the scientific community works toward a complete reference assembly of the human genome56, analysis of genome-wide data from admixed populations will add unique value and help complete our understanding of the human genome's structure and evolution.

URLs. HuRef unplaced scaffolds, ftp://ftp.tigr.org/pub/data/huref/

GenBank database: ftp://ftp.ncbi.nih.gov/genbank/

Database of Genotypes and Phenotypes (dbGaP): http://www.ncbi.nlm.nih.gov/gap

Illumina iControlDB: http://www.illumina.com/science/icontroldb.ilmn

HapMap inter-chromosomal LD: ftp://ftp.ncbi.nlm.nih.gov/hapmap/inter_chr_ld/

Illumina Human BodyMap 2.0 data: http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE30611

Decoy sequences: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/

ONLINE METHODS

Alignment of HuRef genome and GenBank BAC and fosmid clones

To align the HuRef genome and sequenced BAC and fosmid clones to the hg19 reference genome we first downloaded all available sequence from The Institute for Genomic Research and GenBank websites (by downloading the scaffold-not-in-chromosome.fasta file from ftp://ftp.tigr.org/pub/data/huref/ and all gbpri* files from ftp://ftp.ncbi.nih.gov/genbank/), and then used BWA57 (using bwa bwasw) for the alignments against hg19. We identified repeats classified as satellite sequences over HuRef unplaced scaffolds using RepeatMasker58. Satellite sequence consists of large arrays of tandemly repeated units of non-coding DNA. The amount of satellite and missing sequence is reported for each unplaced scaffold (Supplementary Fig. 4 and Supplementary Table 2). To identify within these resources the presence of cryptic segmental duplications, that is, sequence missing from the current reference genome but present in a diverged duplicated form, we aligned all available contigs from HuRef and GenBank clones against hg19 (Supplementary Table 2).

Alignment and variant calls for 1000 Genomes Project data

For genotyping from sequence reads, we selected all the CEU and YRI samples available in the 1000 Genomes Project22,23. Unmapped reads were aligned against the HuRef unplaced scaffolds using BWA59 (using bwa aln/sampe). Genotype calling in the unplaced contigs was performed using the Genome Analysis Toolkit60 (GATK), using default settings for the UnifiedGenotyper walker.

Strategy for admixture mapping

To map the location of a SNP, the genotypes were first adjusted by regressing for amount of global West African ancestry for each sample. The adjusted genotypes were then tested for correlation with local ancestry across the genome using a one-tailed Pearson correlation test. If the correlation of the genotypes with global West African ancestry is positive, a right-tailed test is chosen, otherwise a left-tailed test is chosen. The location corresponding to the smallest p-value is then recorded for each SNP, together with the location corresponding to the smallest p-value in a different chromosome. All these steps were performed using custom MATLAB (2011b, The MathWorks, Natick, MA) scripts.

It is intuitive to expect that genotyping SNPs over paralogous sequences, only one of which will be expected to be polymorphic, will be often incorrect as it won't be possible to correctly infer the homozygous state for the alternate allele, leading to failure of Hardy-Weinberg equilibrium among other things. This is not always so for genotyping arrays however, as genotyping of SNPs is often based on a two-dimensional Gaussian mixture model over summarized probe intensities for each of the two alleles61, enabling the correct distinction of the three possible genotypes even without modeling the presence of a cryptic paralog.

SNP selection, sample selection, and Sequenom genotyping

From all detected SNPs in hg19 unplaced contigs and HuRef unplaced scaffolds we filtered out SNPs at loci for which the number of reads with mapping quality 0 was at least 4 and at least 10%. We also filtered out clusters of four SNPs within a window size of 10bp. The rationale is that in loci with ambiguous alignment, it is possible to call SNPs which actually belong to a paralogous region of the genome. Variants called in loci where many SNPs cluster together have a higher chance to be an artifact of misaligned reads originating from paralogous regions that are not present in the reference genome used for the alignment. This methodology maximizes the chances that a SNP belongs to the unplaced scaffold where it is called. Among the filtered list, up to 7 ancestry-informative SNPs were chosen for each contig for which genotype was estimated to have Pearson correlation coefficient with amount of local European ancestry satisfying r2>15%. SNPs were further filtered to fit within 10 Sequenom plexes, prioritizing degree of correlation with ancestry. We selected 380 samples from the Jackson and Heart Study (JHS)24, which had been genotyped with the Affymetrix 6.0 array and analyzed with HAPMIX62. To achieve the maximum possible mapping resolution, we exclusively selected samples with at least 62 detected crossovers between ancestries (maximum was 115).

Most likely due to the repetitiveness of the flanking sequences for which primers were designed, 86 assays failed completely; of the remainder, 53 failed the Hardy-Weinberg equilibrium test (p<10-6), and 175 passed. Nevertheless, we could still reliably identify the location of 139 SNPs (Pearson correlation test p<10-6), 106 of which had passed and 33 of which had failed Hardy-Weinberg equilibrium test, showing that SNPs with unreliable genotypes can still be informative for mapping purposes (Supplementary Fig. 4 and Supplementary Table 1). By analyzing for each successfully mapped SNP the best correlation between adjusted genotype and local ancestry on chromosomes other than the one where the SNP mapped, we estimated that the selected conservative threshold of 10-6 for the p-value gives a false discovery rate lower than 1%.

Analysis of cryptic paralogs from 1000 Genomes Project pilot data

To identify regions with an excess of PSVs suggesting the presence of large cryptic segmental duplications, we searched for SNPs across the reference genome whose probabilistic genotype from 1000 Genomes Project pilot low-pass sequencing data failed the Hardy-Weinberg equilibrium test63 (using bcftools view -c). We identified variants that failed the equilibrium test (p<10-6) in CEU and YRI samples, grouped them together if they were <5kbp apart (using custom MATLAB scripts), and listed all resulting regions >40kbp (Supplementary Table 6).

Fluorescence in situ hybridization

Peripheral blood mononuclear cells were stimulated with phytohemagglutinin and harvested. Metaphase spreads were prepared by standard protocols. Fosmid clones spanning the regions of interest were selected for FISH mapping using the University of California Santa Cruz (UCSC) Genome Browser (http://genome.ucsc.edu/). Fosmids were labeled with either SpectrumOrange or SpectrumGreen conjugated dUTP using a nick translation kit (Abbott Molecular, Des Plaines, IL). Labeled pairs were hybridized overnight to metaphase chromosome preparations. Following 4x SSC/0.1% Tween, 2xSSC/0.3% Tween, and phosphate-buffered detergent washes, chromosomes were counterstained with DAPI and analyzed by epi-fluorescence with a Zeiss Axioplan2 microscope (Thornwood, NY) and Applied Imaging CytoVision software (Santa Clara, CA).

Analysis of sequence read depth from 1000 Genomes Project data

To assess the copy number variability of the missing reference segments, we used an updated version of Genome STRiP64 to analyze read depth. Normalized read depth was measured by comparing the number of DNA fragments with sequencing reads aligned to the reference genome in a given region to the expected read depth per haploid copy based on (i) the total sequencing depth for each sample, (ii) the alignability of each position, based on whether it would be uniquely mapped by a perfect 36bp read, and (iii) sequencing bias due to GC content.

We performed normalization for GC-bias empirically, similar to 38. We first identified a 588Mbp subset of the autosomal reference sequence with no known evidence of copy number variation to use as a baseline. We removed all positions within 200bp of annotated CNV regions listed in DGV, segmental duplications listed in the UCSC browser, repeats annotated by RepeatMasker58 and assembly gaps, yielding a subset that is highly likely to be copy number invariant in the majority of people. This reference subset is divided into 400bp windows, stratified by GC fraction within each window, and the observed read depth at each GC fraction is compared to the total read depth across all windows to yield a GC-normalization curve for each sequencing library.

Given a genomic locus, estimation of diploid copy number for each sample was done by fitting a Gaussian mixture model with sample-specific variance to the observed and expected read depth for each sample64, allowing the model to fit as many copy number classes as needed at each locus.

To analyze genome regions with known paralogs in sequences not in the hg19 reference (notably 2p22.2), we used BWA59 (using bwa aln/sampe) to realign the 1000 Genomes reads from the genomic region to a synthetic reference containing the original reference sequence plus the sequence for the extra paralog. Estimation of copy number was then carried out as described above.

Analysis of RNA sequence expression data

To compare expression of different paralogs of genes DUSP22, PRIM2, HYDIN, MAP2K3, and KCNJ12, we first identified PSVs over the predicted mRNA for these genes looking at all heterozygous loci called for 1000 Genomes Project pilot high coverage samples NA12878/CEU, NA12891/CEU, NA12892/CEU, NA19238/YRI, NA19239/YRI, and NA19240/YRI and then determined, when possible, which allele belongs to each paralog (Supplementary Tables 9–13). Once we obtained a list of all PSVs, we counted reads from the Illumina Human BodyMap 2.0 project for each of the alleles observed at the locus using the GATK60 (using default settings for the UnifiedGenotyper walker and custom scripts). To validate the findings and filter out possible artifacts, sequence reads were further manually analyzed using the Integrative genomics viewer65 (IGV).

Supplementary Material

ACKNOWLEDGEMENTS

This study was supported by grants RC1 GM091332-01 (S.A.M. and J.G.W.) and R01DK54931 (G.G. and M.R.P.) from the National Institutes of Health, and by a Smith Family Foundation Award for Excellence in Biomedical Research (S.A.M.).

The Jackson Heart Study is supported and conducted in collaboration with Jackson State University (N01-HC-95170), University of Mississippi Medical Center (N01-HC-95171), and Touglaoo College (N01-HC-95172) contracts from the National Heart, Lung, and Blood Institute (NHLBI) and the National Institute for Minority Health and Health Disparities (NIMHD), with additional support from the National Institute on Biomedical Imaging and Bioengineering (NIBIB).

The Atherosclerosis Risk in Communities Study is carried out as a collaborative study supported by National Heart, Lung, and Blood Institute contracts (HHSN268201100005C, HHSN268201100006C, HHSN268201100007C, HHSN268201100008C, HHSN268201100009C, HHSN268201100010C, HHSN268201100011C, and HHSN268201100012C).

The Coronary Artery Risk Development in Young Adults Study (CARDIA) is conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with the University of Alabama at Birmingham (N01-HC95095 & N01-HC48047), University of Minnesota (N01-HC48048), Northwestern University (N01-HC48049), and Kaiser Foundation Research Institute (N01-HC48050).

MESA, MESA Family, and the MESA SHARe project are conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with MESA investigators. Support for MESA is provided by contracts N01-HC-95159 through N01-HC-95169 and RR-024156. Funding for MESA Family is provided by grants R01-HL-071051, R01-HL-071205, R01-HL-071250, R01-HL-071251, R01-HL-071252, R01-HL-071258, R01-HL-071259. MESA Air is funded by the US EPA - Science to Achieve Results (STAR) Program Grant # RD831697. Funding for genotyping was provided by NHLBI Contract N02-HL-6-4278 and N01-HC-65226. This manuscript does not necessarily reflect the opinions or views of ARIC, CARDIA, JHS, MESA, or the NHLBI.

Footnotes

AUTHOR CONTRIBUTIONS

G.G. and S.A.M. conceived the project, designed the analyses and wrote the manuscript. G.G. performed the analysis of the CARe, ICDB, JHS, and BodyMap 2.0 datasets. R.E.H. performed the sequence read depth analysis of selected regions. H.L. performed the alignments of HuRef scaffolds and GenBank clones. N.A. contributed the analysis of the HuRef unplaced scaffolds. A.M.L. performed the FISH experiments. K.C. organized and contributed to the design of the Sequenom experiment. B.P., A.P., and D.R. provided advice for the local ancestry inference. C.C.M. participated in interpretation of the FISH experiments. M.R.P. participated in planning discussions for the linkage analysis. J.G.W. participated in planning discussions, coordinated interactions with the Jackson Heart Study, and edited the manuscript.

COMPETING FINANCIAL INTERESTS

The authors declare no competing financial interests.

References

1. International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. [PubMed]
2. Venter JC, et al. The sequence of the human genome. Science. 2001;291:1304–1351. [PubMed]
3. Li R, et al. Building the sequence map of the human pan-genome. Nature biotechnology. 2009;28:57–63. [PubMed]
4. Kidd JM, et al. Characterization of missing human genome sequences and copy-number polymorphic insertions. Nature methods. 2010;7:365–371. [PMC free article] [PubMed]
5. Kirsch S, et al. Interchromosomal segmental duplications of the pericentromeric region on the human Y chromosome. Genome Res. 2005;15:195–204. [PubMed]
6. Lyle R, et al. Islands of euchromatin-like sequence and expressed polymorphic sequences within the short arm of human chromosome 21. Genome Res. 2007;17:1690–1696. [PubMed]
7. Rubin EM, Lucas S, Richardson P, Rokhsar D, Pennacchio L. Finishing The Euchromatic Sequence Of The Human Genome. Nature. 2004;431:931–945. [PubMed]
8. Lander ES. Initial impact of the sequencing of the human genome. Nature. 2011;470:187–197. [PubMed]
9. Pickrell JK, Gaffney DJ, Gilad Y, Pritchard JK. False positive peaks in ChIP-seq and other sequencing-based functional assays caused by unannotated high copy number regions. Bioinformatics. 2011;27:2144–2146. [PMC free article] [PubMed]
10. Eichler EE, Clark RA, She X. An assessment of the sequence gaps: Unfinished business in a finished human genome. Nature Reviews Genetics. 2004;5:345–354. [PubMed]
11. Botstein D, White RL, Skolnick M, Davis RW. Construction of a genetic linkage map in man using restriction fragment length polymorphisms. Am. J. Hum. Genet. 1980;32:314–331. [PubMed]
12. Donis-Keller H, et al. A genetic linkage map of the human genome. Cell. 1987;51:319–337. [PubMed]
13. Weissenbach J, et al. A second-generation linkage map of the human genome. Nature. 1992;359:794–801. [PubMed]
14. Kong A, et al. A high-resolution recombination map of the human genome. Nature genetics. 2002;31:241–247. [PubMed]
15. Reich D, et al. A whole-genome admixture scan finds a candidate locus for multiple sclerosis susceptibility. Nature genetics. 2005;37:1113–1118. [PubMed]
16. Winkler CA, Nelson GW, Smith MW. Admixture Mapping Comes of Age*. Annual review of genomics and human genetics. 2010;11:65–89. [PubMed]
17. Hinch AG, et al. The landscape of recombination in African Americans. Nature. 2011;476:170–175. [PMC free article] [PubMed]
18. Wegmann D, et al. Recombination rates in admixed individuals identified by ancestry-based inference. Nature Genetics. 2011;43:847–853. [PubMed]
19. Seldin MF, Pasaniuc B, Price AL. New approaches to disease mapping in admixed populations. Nat. Rev. Genet. 2011;12:523–528. [PMC free article] [PubMed]
20. Levy S, et al. The diploid genome sequence of an individual human. PLoS biology. 2007;5:e254. [PMC free article] [PubMed]
21. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. GenBank. Nucleic Acids Res. 2011;39:D32–37. [PMC free article] [PubMed]
22. The 1000 Genomes Project Consortium A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073. [PMC free article] [PubMed]
23. The 1000 Genomes Project Consortium An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491:56–65. [PMC free article] [PubMed]
24. Taylor HA, Jr, et al. Toward resolution of cardiovascular health disparities in African Americans: design and methods of the Jackson Heart Study. Ethn Dis. 2005;15:S6–17. [PubMed]
25. Musunuru K, et al. Candidate gene association resource (CARe): design, methods, and proof of concept. Circ Cardiovasc Genet. 2010;3:267–275. [PMC free article] [PubMed]
26. The International HapMap Consortium A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007;449:851–861. [PMC free article] [PubMed]
27. Martin J, et al. The sequence and analysis of duplication-rich human chromosome 16. Nature. 2004;432:988–994. [PubMed]
28. Doggett NA, et al. A 360-kb interchromosomal duplication of the human HYDIN locus. Genomics. 2006;88:762–771. [PubMed]
29. Kim JI, Ju YS, Kim S, Hong D, Seo JS. Detection of hydin Gene Duplication in Personal Genome Sequence Data. Genomics & Informatics. 2009;7:159–162.
30. Reiner AP, et al. Genome-Wide Association Study of White Blood Cell Count in 16,388 African Americans: the Continental Origins and Genetic Epidemiology Network (COGENT). PLoS Genetics. 2011;7:e1002108. [PMC free article] [PubMed]
31. Guipponi M, et al. Genomic structure of a copy of the human TPTE gene which encompasses 87 kb on the short arm of chromosome 21. Human genetics. 2000;107:127–131. [PubMed]
32. Bailey JA, Yavor AM, Massa HF, Trask BJ, Eichler EE. Segmental duplications: organization and impact within the current human genome project assembly. Genome Res. 2001;11:1005–1017. [PubMed]
33. Bailey JA, et al. Recent segmental duplications in the human genome. Science. 2002;297:1003–1007. [PubMed]
34. Bailey JA, et al. Human-specific duplication and mosaic transcripts: the recent paralogous structure of chromosome 22. The American Journal of Human Genetics. 2002;70:83–100. [PubMed]
35. Golfier G, et al. The 200-kb segmental duplication on human chromosome 21 originates from a pericentromeric dissemination involving human chromosomes 2, 18 and 13. Gene. 2003;312:51–59. [PubMed]
36. Ruault M, Ventura M, Galtier N, Brun ME, Archidiacono N. BAGE genes generated by juxtacentromeric reshuffling in the Hominidae lineage are under selective pressure. Genomics. 2003;81:391–399. [PubMed]
37. Dennis MY, et al. Evolution of Human-Specific Neural SRGAP2 Genes by Incomplete Segmental Duplication. Cell. 2012;149:912–922. [PMC free article] [PubMed]
38. Sudmant PH, et al. Diversity of human copy number variation and multicopy genes. Science. 2010;330:641–646. [PMC free article] [PubMed]
39. The BAC Resource Consortium Integration of cytogenetic landmarks into the draft sequence of the human genome. Nature. 2001;409:953–958. [PubMed]
40. Mahtani MM, Willard HF. Physical and genetic mapping of the human X chromosome centromere: repression of recombination. Genome research. 1998;8:100–110. [PubMed]
41. Samonte RV, Eichler EE. Segmental duplications and the evolution of the primate genome. Nature Reviews Genetics. 2002;3:65–72. [PubMed]
42. She X, et al. The structure and evolution of centromeric transition regions within the human genome. Nature. 2004;430:857–864. [PubMed]
43. Zhang J, Feuk L, Duggan GE, Khaja R, Scherer SW. Development of bioinformatics resources for display and analysis of copy number and other structural variants in the human genome. Cytogenetic and genome research. 2006;115:205–214. [PubMed]
44. Ryan DP, et al. Mutations in potassium channel Kir2. 6 cause susceptibility to thyrotoxic hypokalemic periodic paralysis. Cell. 2010;140:88–98. [PMC free article] [PubMed]
45. Eichler EE. Recent duplication, domain accretion and the dynamic mutation of the human genome. Trends in Genetics. 2001;17:661–669. [PubMed]
46. Gnerre S, et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc. Natl. Acad. Sci. U.S.A. 2011;108:1513–1518. [PubMed]
47. Christiansen J, et al. Chromosome 1q21. 1 contiguous gene deletion is associated with congenital heart disease. Circulation research. 2004;94:1429–1435. [PubMed]
48. Stone JL, et al. Rare chromosomal deletions and duplications increase risk of schizophrenia. Nature. 2008;455:237–241. [PubMed]
49. Stefansson H, et al. Large recurrent microdeletions associated with schizophrenia. Nature. 2008;455:232–236. [PMC free article] [PubMed]
50. Mefford HC, et al. Recurrent rearrangements of chromosome 1q21. 1 and variable pediatric phenotypes. New England Journal of Medicine. 2008;359:1685–1699. [PMC free article] [PubMed]
51. Brunetti-Pierri N, et al. Recurrent reciprocal 1q21.1 deletions and duplications associated with microcephaly or macrocephaly and developmental and behavioral abnormalities. Nat. Genet. 2008;40:1466–1471. [PMC free article] [PubMed]
52. Lango Allen H, et al. Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature. 2010;467:832–838. [PMC free article] [PubMed]
53. Sırmacı A, et al. A Truncating Mutation in SERPINB6 Is Associated with Autosomal-Recessive Nonsyndromic Sensorineural Hearing Loss. The American Journal of Human Genetics. 2010;86:797–804. [PubMed]
54. Alkan C, Sajjadian S, Eichler EE. Limitations of next-generation genome sequence assembly. Nature methods. 2010;8:61–65. [PMC free article] [PubMed]
55. Ju YS, et al. Extensive genomic and transcriptional diversity identified through massively parallel DNA and RNA sequencing of eighteen Korean individuals. Nat. Genet. 2011;43:745–752. [PubMed]
56. Church DM, et al. Modernizing reference genome assemblies. PLoS Biol. 2011;9:e1001091. [PMC free article] [PubMed]
57. Li H, Durbin R. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics. 2010;26:589–595. [PMC free article] [PubMed]
58. Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 2004.
59. Li H, Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009;25:1754–1760. [PMC free article] [PubMed]
60. DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature genetics. 2011;43:491–498. [PMC free article] [PubMed]
61. Korn JM, et al. Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs. Nature genetics. 2008;40:1253–1260. [PMC free article] [PubMed]
62. Price AL, et al. Sensitive detection of chromosomal segments of distinct ancestry in admixed populations. PLoS Genetics. 2009;5:e1000519. [PMC free article] [PubMed]
63. Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011;27:2987–2993. [PMC free article] [PubMed]
64. Handsaker RE, Korn JM, Nemesh J, McCarroll SA. Discovery and genotyping of genome structural polymorphism by sequencing on a population scale. Nat. Genet. 2011;43:269–276. [PubMed]
65. Robinson JT, et al. Integrative genomics viewer. Nature Biotechnology. 2011;29:24–26. [PMC free article] [PubMed]