|Home | About | Journals | Submit | Contact Us | Français|
The extent of human genomic structural variation suggests that there must be portions of the genome yet to be discovered, annotated and characterized at the sequence level. We present a resource and analysis of 2,363 novel insertion sequences corresponding to 720 genomic loci. We show that a substantial fraction of these sequences are either missing, fragmented or mis-assigned when compared to recent de novo sequence assemblies from short-read next-generation sequence data. We determine that 18–37% of these novel insertions are copy-number polymorphic, including loci that show extensive population stratification among Europeans, Asians and Africans. Complete sequencing of 156 of these insertions identifies novel exons and conserved non-coding sequences not yet represented in the reference genome. We develop a method to accurately genotype these novel insertions by mapping next-generation sequencing datasets to the breakpoint thereby providing a means to characterize copy-number status for regions previously inaccessible to SNP microarrays.
The human genome reference assembly is a mosaic of distinct haplotypes sampled from multiple individuals1. As a result of both gaps in the assembled sequence and the structural differences that exist among different humans, individual genome projects are expected to uncover human sequences present in some (or all) individuals that are not represented in the assembly. Consistent with this prediction, the first sequences of individual genomes2, 3 revealed 23–29 Mb of sequence that do not map against the reference assembly. The short-read, high-throughput approaches currently being employed are also expected to uncover unrepresented insertions4–7. However, these sequences often assemble only as short (median length of 220 to 314 bp 7) contiguous sequences (contigs) that are difficult to anchor and incorporate into existing genome assemblies. Thus, while thousands of novel sequences may be discovered over the next few years, their annotation and complete integration into the human genome will remain a significant bottleneck 8. Since genotyping and expression microarrays are fundamentally dependent upon the reference genome for array probe design, a small fraction of the human genome effectively can not be assayed.
We recently reported efforts to systematically map and sequence human genome structural variation using a fosmid end-sequence pair mapping approach9–11. We fragmented genomic DNA from nine human individuals and subcloned 40-kb segments. Using standard capillary sequencing, reads were generated from both ends of each fragment (end-sequence pairs) and clones were mapped to the human reference genome. Structural differences (inversions, deletions, insertions and translocations) between the reference genome assembly and library source were identified based on the mapped location of the end-sequence pairs. Since the individual fosmid clones were retained, the procedure allowed simultaneous discovery and complete sequence characterization of a subset of structural variant loci including novel insertion sequences common to most individuals but not represented in the human reference genome. Here, we present a detailed sequence and copy-number analysis of these segments missing from the human reference genome.
We systematically searched 9.7 million end-sequence pairs, corresponding to 92-fold physical coverage of the human genome, for sequences that failed to map to the reference sequence (NCBI build35). The end-sequence dataset was derived from nine individual genomes (4 Yoruba individuals from Ibadan Nigeria (YRI), 2 individuals with European ancestry (CEU), 2 individuals with Han Chinese or Japanese ancestry (CHB+JPT), and 1 individual of unknown ethnicity). We distinguished clones that only mapped onto the assembly with one end (one-end anchored or OEA, clones) and orphan clones where neither end mapped. After eliminating low-quality sequence and obvious viral and bacterial contaminants, we identified 44,415 high-quality fosmid end sequences that do not map onto the genome reference sequence (NCBI build35)11. This set includes individual sequences from 26,001 OEA clones and 9,207 orphan clones. Using phrap (http://www.phrap.org), we initially assembled these individual sequences into 3,963 sequence contigs (total size = 4.47 Mb, N50 = 1,148 bp) (Table 1) but after applying additional experimental and computational filters, this was reduced to 2,363 distinct sequence contigs (Supplementary Note).
40% (1,019/2,363) of the contigs contain sequence contributed by at least one orphan clone, suggesting that these contigs represent segments longer than 40 kb (Supplementary Table 1). Using OEA anchoring information and the mate-pair relationships from the orphan clones, we identified 720 contigs (400 of which have a mapped genomic position) corresponding to ~2.8 Mbp of sequence with a median contig size of 1 kb (Supplementary Note). Interestingly, 80 of the 400 anchored loci (20%) map within 5 Mb of the ends of a chromosome (a significant 2.9-fold subtelomeric enrichment, p=1.0e-18, binomial test) (Supplementary Fig. 1, Supplementary Table 2). In addition to these 720 loci, we identified 19,038 singleton OEA sequences (average length 790 bp) as well as 5,654 orphan clones that did not contribute to any contigs. By convention, we refer to these sequences as “novel insertions” based on the fact that they are not present within the public reference genome assembly.
Our analysis distinguished two different types of novel human sequences: 400 loci that were anchored within euchromatin based on OEA assignments and 320 unassigned loci where a clear anchor position could not be identified. We explored the genomic distribution and assessed the accuracy of our assigned locations using individual fosmid clones as FISH probes. Although limited to larger regions, this analysis provided us valuable high-level mapping information with respect to the distribution of insertions in heterochromatin and euchromatin. We selected 33 contigs derived only from orphan clones (assigned to seven distinct unmapped loci) and mapped these loci to metaphase chromosomes by FISH. Three loci mapped separately to telomeric regions on chromosomes 10q, 7p, and Xp; one locus mapped to 6q1; and three loci mapped to the p-arms of the acrocentric chromosomes (Supplementary Table 1).
As a complement to these studies we also tested an additional 68 large orphan contigs, which were constructed based on a detailed fingerprint analysis of all orphan clones from a single individual human genome library (NA15510) (Supplementary Note). After excluding 31 contigs assigned to genome assembly gaps12, we found that 15 of the contigs mapped interstitially, with the remainder (22/37) mapping to telomeric, pericentromeric or acrocentric positions (Supplementary Table 3, Supplementary Fig. 2).
Finally, we considered sequence contigs that had been anchored by OEA clones, but also had contributions from at least one orphan clone, to positions in human euchromatin by using 37 fosmid clones (20 OEA and 17 orphan clones) as FISH probes. We found that 78% (29/37) of the clones support the predicted position while 11% (4/37) map to a different interstitial location and 11% (4/37) map to the p-arms of the acrocentric chromosomes. We additionally tested a limited number (n = 3) of smaller insertions (<30 kb) that had been completely sequenced and confirmed all three by metaphase oligo-FISH, finding that 2/3 were copy-number polymorphic among the four individuals tested (Supplementary Note). Our FISH results indicate that mega-bases of uncharacterized sequence remain within the heterochromatin and euchromatin-heterochromatin transition regions of the human genome but also confirm the presence of missing euchromatic sequences that are copy-number polymorphic.
We searched for evidence of the identified 2,363 sequence contigs in other human and non-human primate genome assemblies. 600 contigs (71 loci)have a match against the newest human reference genome assembly, GRCh37 and 1,467 contigs (54 loci) have a match against the HuRef assembly 2 (Supplementary Note). We find partial support for 1,700–2,000 of the contigs in sequence data from the JDW, YH, and NA18507 genomes3–5 (Supplementary Note). One of the genomes in our study, NA18507, was sequenced to high coverage using the Illumina platform4 and subjected to a SOAP de novo sequence assembly 7. Surprisingly, we found that the 94% of our smallest insertions identified from single unmapped reads (~790 bp) were not identified as part of the de novo assembly (Supplementary Note). 32% of our larger contigs had no representation and only 25% had complete sequence coverage (defined as more than 95% bp representation). When we restricted our analysis to insertions from sequenced NA18507 clones, we found that 52% (11/21 sequenced fragments) were either not present (n = 4) or mapped to different scaffolds (n = 7) in the de novo assembly. We find that this fragmentation often corresponds to the presence of large common repeat sequences that disrupt the contiguity and complicate map assignment. Regions largely devoid of common repeats or segmental duplications showed the greatest correspondence in length and coverage.
In order to determine the ancestral state of each of these sequences, we also searched the 2,363 contigs against available whole-genome sequence data from chimpanzee and orangutan13. 74% (1,745/2,363) of the contigs had a match against one of these datasets with 68% (1,599/2,363) of the contigs identified within chimpanzee. We were concerned that these sequences may have characteristics leading to their underrepresentation in genome-sequencing datasets, so we performed an arrayCGH experiment using DNA from a single chimpanzee and tested whether the DNA in fact hybridized. This experiment indicated that 84% (1,985/2,363) of the contigs were present in the single chimpanzee analyzed (Supplementary Note). This includes 624 contigs that do not have a match to the chimpanzee genome sequence data. In total, we find experimental or computational support for 94% (2,223/2,363) of the contigs in the chimpanzee and 96% (2,266/2,363) in either chimpanzee or orangutan. The absence of these new insertions in the current reference genome represents either genome assembly errors or deletions that have emerged within the human lineage and are now copy-number polymorphic in our species.
We designed two customized oligonucleotide microarrays in order to provide an assessment of copy-number polymorphism among these novel insertion sequences. In the first, we designed a microarray targeting the 19,038 single OEA sequences that did not assemble into sequence contigs and tested them against the sample genomes used for discovery. After filtering additional contaminants, we found that 38% (7,240/19,038) of these unassembled sequences were represented by at least three probes with signal intensities sufficiently above the background level. Based on a comparison of the intensity values for the eight analyzed samples, we estimate that 31% (2,228/7,240) of the assayable single OEA sequences are copy-number polymorphic (Supplementary Table 4).
In the second design, we investigated copy-number polymorphism for the 2,363 sequence contigs that had been assigned to 720 distinct loci and tested a larger collection of 28 unrelated HapMap individuals (9 CEU, 11 YRI, 8 JPT+CHB). These experiments clearly identified sets of sequences that are copy-number polymorphic or apparently fixed among the analyzed individuals (Fig. 1). Polymorphic contigs were identified using two alternative calling schemes: a noise-multiplier approach that compares the median probe log-ratios for each contig with the results of a control self-self hybridization (using reference sample NA15510, Supplementary Table 5) and a clustering approach that assigns contigs to log-ratio clusters14 that are then fitted to distinct, small integer copynumber states (Fig. 2, Supplementary Table 6). The noise-multiplier approach identifies 37% of the contigs as being copy-number polymorphic. 518 contigs could be fitted to a copy-number state, of which 461 contigs are fitted to two or more distinct copy-number states. 443 contigs (18.7%) were identified as polymorphic by both approaches, an indication of the challenges in assigning discrete copy numbers to all copy-number variable loci.
We assessed the extent of population differentiation for these sequences using both the FST and VST statistics15, 16. For 189 loci with a simple autosomal insertion-deletion variant, we found 20 loci having an FST greater than 0.35 (Fig. 3, Supplementary Table 7, Supplementary Table 8, and Supplementary Fig. 3). Among these, we identified a 3.9-kb insertion sequence within the first intron of the lactase gene (LCT)(Fig. 2). Interestingly, this 3.9-kb insertion is prevalent among the YRI samples tested (allele frequency=0.86) but is largely absent among the CEPH Europeans (allele frequency = 0.11) where it is in complete linkage disequilibrium (D’ = 1) with the functional SNP that has been associated with lactase persistence17. We repeated the analysis using the VST statistic, for all 720 loci (Supplementary Fig. 4). We identified 27 loci that have a VST value greater than 0.35 with ten having a value greater than 0.5 (Supplementary Table 9). Fosmid clones corresponding to several of the most stratified loci have been completely sequenced, including a 4.8-kb insertion on chr20 (AC205876, Fig. 2, VST = 0.73, FST = 0.70) and an 11.4-kb insertion on chr1 near the ATP6V1G3 gene (AC212752, VST = 0.48, FST = 0.37). These sites represent structures that show a high level of differentiation among human populations but are absent from the genome reference.
The complete sequence of insertions smaller than 40 kb can be directly obtained by sequencing an appropriate fosmid clone, while an iterative strategy is required to capture the sequence of larger insertions. 222 fosmid clones (53 OEA clones and 169 spanned insertions) were sequenced using a traditional capillary sequencing and assembly approach (Supplementary Table 10, Supplementary Fig. 5). The 222 clones correspond to 192 distinct genomic loci and contain a total of 1.67 Mb of inserted sequence (Supplementary Fig. 6) subsuming 475 of our original 2,363 contigs. Four of the completely sequenced insertions, ranging in size from 41–65 kb, were larger than a single clone insert (Supplementary Note). The sequenced insertions are similar in composition to segments sampled from the reference genome assembly, with a slight enrichment for common repeats, particularly LINEs (Supplementary Table 11). Only five of the 192 loci (Supplementary Table 12) have been updated in GRCh37, thus the majority (97%) of these insertions await integration into the next version of the human genome.
We searched the sequenced insertions against the RefSeq gene database18 to identify previously uncharacterized exons. We found that segments from 22 genes matched 21 of the insertions (Supplementary Table 13) including support for structures not represented in the build36 assembly (eg. MINK1, FSCN2, PECAM1, and VPRBP genes (Figure 4). We further searched for expressed elements using mRNA-seq data derived from multiple human tissues that do not map onto the build36 genome assembly19. We mapped these previously unmapped reads onto the sequenced clones and found that 26 insertions contained segments supported by at least three mRNA-seq reads (Supplementary Fig. 5, Supplementary Note). We searched against an alignment of nine mammalian genomes to identify segments matching the sequenced insertions (Ensembl Compara 51) 20, 21. Using these alignments, we identified 477 constrained elements from 104 different loci (Fig. 4), a signature that identifies segments of possible functional importance22. Six of the constrained elements intersect with mapped RefSeq exons with the remainder having an unknown functional importance. Using Genomic Evolutionary Rate Profiling (GERP) scores as a metric, we note that the conserved elements found in the insertions show a similar level of constraint compared to elements identified across the rest of the alignments (Supplementary Fig. 7).
High-quality sequence across the variant breakpoints permitted a detailed assessment of exact variant boundaries and associated sequences. We used the breakpoint sequence data obtained from 152 insertions spanned by individual fosmid clones to identify a set of unique, diagnostic k-mers specific to the insertion and deletion alleles of each variant (Fig. 5). We found that 108 of the sequenced loci could be uniquely identified using a k-mer length of 36 and a search stringency of one substitution. 29% of the loci (44/152) could not be uniquely identified using this approach, although we note that this method assumes that the genome reference assembly accurately represents the structure of the deletion allele and all instances of the variant have identical breakpoints. If k-mer lengths increased to 100 bp, there would still be five loci that remained recalcitrant to analysis using this approach (Fig. 5b). We determined genotypes for 106 loci by searching Illumina sequence data from NA18507 against these diagnostic k-mers4. We observed agreement at 94.3% of the genotypes determined for this individual by arrayCGH (Fig. 5c, Supplementary Table 14). We simulated the effect of genome coverage by sampling subsets of the total sequence data from NA18507 (Fig. 5d). We found a rapid increase in the number and accuracy of the sites genotyped with increasing coverage, followed by a plateau of approximately 94% genotype agreement when sequencing coverage reached 10-fold sequence coverage. This indicates that high-quality breakpoint sequence data can be used to genotype structural variants in samples that have been analyzed by next-generation sequencing.
Over the past five years the extent of structural variation among individual human genomes has become increasingly clear. Array-based approaches, for example, have systematically discovered and genotyped more than 50% of common copy-number polymorphic deletions23, 24. Sequence-based approaches have begun to more fully explore the size spectrum, cataloging an increasing number of smaller deletions and moving toward personalized duplication maps for individual genomes9, 11, 25 , 26. The characterization of other classes of structural variation, including inversions and insertions, however, has lagged due to technical biases in their discovery and difficulties associated with their validation. New insertions are limited, in particular, by the genetic community’s reliance on a single mosaic reference genome, which at some positions represents rare structural configurations and entirely omits sequences that are found in the majority of individuals. The absence of these sequences from the reference genome hinders their functional characterization leading to a less-than-complete understanding of the sequence content present in the majority of humans. We used a fosmid clone strategy to specifically focus on the characterization of human sequences that are not in the reference assembly and have therefore not been annotated for functional elements or systematically genotyped.
In this study we identified 720 distinct loci ranging in size from 1–20 kbp in length as well as several thousand additional smaller segments <1 kbp in length. We have determined that more than half map to the euchromatin with a disproportionate fraction mapping within the last 5 Mbp of human chromosomes (Supplementary Fig. 1). A remarkable feature of these sequences is their degree of copy-number polymorphism. ArrayCGH analysis indicates that 18–37% of the assembled sequence contigs vary in copy number, with 80% of the genotyped variants having a minor allele frequency >10% among the 28 individuals surveyed (Fig. 3). Experimental and computational comparisons with chimpanzee DNA suggest that at least 94% arose as a result of deletions that occurred within the human lineage.
Many of the common insertions show striking differences in allele frequency among populations, a pattern suggestive of either selection or genetic drift since the migration of humans out of Africa (Fig. 2, Supplementary Table 8, Supplementary Table 9). We observe that the average insertion allele frequency for the variable loci was significantly greater in African populations when compared to European or Asians (YRI versus CEU p = 0.0003 and YRI versus ASN p = 0.005, 1 sided t-test). The 3.9-kb novel insertion within the first intron of the LCT gene is illustrative. Our initial survey suggests that this insertion sequence is prevalent among the Yoruba (86%) and Asian samples (63%) but is present at a much lower frequency among CEPH Europeans (11%). These findings raise the possibility that the additional sequence within this haplotype may play a role in regulating expression of this gene. The complete sequence of this insertion sequence (AC20193) now allows this hypothesis to be directly tested.
An important question going forward is how well de novo assembly methods using next-generation sequence data compare to the clone-based approach we have described here. We had the opportunity to compare an Illumina SOAP de novo assembly 7 against the clone-based discovery on the same individual genome (Supplementary Note). We found that many of the larger novel contigs were only partially represented (50–60%) in a 30X de novo assembly, and in more than a third of studied cases novel contigs were fragmented—mapping to two or more scaffolds instead of being placed in the same region. In many cases, the fragmentation corresponded to common repeats disrupting the contiguity of the novel sequence. In regions largely devoid of retrotransposons, de novo sequence assemblies using NGS datasets perform quite well. These results highlight both the limitation of de novo sequence assembly using NGS and the value of high-quality clone-based data to resolve and integrate these sequences into the reference genome. Nevertheless, there are advantages to de novo assembly. The de novo sequence assembly identifies 2–3 times more novel sequence per genome when compared to our results from 0.3X sequence coverage per genome, suggesting that the methods are complementary. Surprisingly, only 2.9% of our singletons from NA18507 (average size ~790 bp) were identified in the de novo assembly. Since these smaller insertions require more characterization, the significance of this discrepancy is unclear.
The major benefit of our approach is the ability to directly obtain high-quality sequence for the insertion loci by complete sequencing of corresponding clone inserts at a quality commensurate with that of the human reference genome. While no complete missing genes were discovered, we did identify 477 elements that have been conserved over evolutionary time, six of which appear to correspond to exons from RefSeq genes as well as 26 loci having support from multiple mRNA-seq reads. Moreover, we demonstrate that these high-quality sequences can be utilized to accurately genotype these regions using next-generation sequence sets produced from the 1000 Genomes and other projects. The complete sequence of these and other loci will facilitate their functional characterization as they can now be incorporated into future genotyping platforms, expression microarrays, and ultimately future genome assemblies to provide a more accurate representation of the organization and genetic variation of the human genome.
We thank C. Campbell, G. Cooper, T. Marques-Bonet for thoughtful discussion, P. Sudmant for assistance with Illumina sequence data, and members of the University of Washington and Washington University Genomes Centers for assistance with data generation. J.M.K. is supported by a National Science Foundation Graduate Research Fellowship. This work was supported by grant HG004120 to E.E.E. E.E.E. is an Investigator of the Howard Hughes Medical Institute.
Gene Expression Omnibus: GSE20634
GenBank accessions for assembled contigs: GU266782 – GU269144
The accessions for fully sequenced fosmid inserts are given in Supplementary Table 10 and are also listed below
Competing financial interests
N.S. P.A. A.T. N.Y. P.T. and L.B. are employees of Agilent Technologies. E.E.E. is a SAB member for Pacific Biosciences.
Author contributionsJ.M.K, N.S., F.A., A.T., R.K. and E.E.E. analyzed data, N.S., P.A, A.T., N.A.Y, P.T. and L.B. performed arrayCGH and copy-number analysis. F.A, M.V. and G.G. performed FISH experiments. C.A. assembled contigs, T.G., R.F., H.S.H, M.M, J.K., R.K, and R.K.W. performed clone characterization and sequencing. J.M.K, R.K., L.B. and E.E.E designed the study. J.M.K and E.E.E. wrote the paper with contributions from the other authors.