Over the past five years the extent of structural variation among individual human genomes has become increasingly clear. Array-based approaches, for example, have systematically discovered and genotyped more than 50% of common copy-number polymorphic deletions23, 24
. Sequence-based approaches have begun to more fully explore the size spectrum, cataloging an increasing number of smaller deletions and moving toward personalized duplication maps for individual genomes9, 11, 25 , 26
. The characterization of other classes of structural variation, including inversions and insertions, however, has lagged due to technical biases in their discovery and difficulties associated with their validation. New insertions are limited, in particular, by the genetic community’s reliance on a single mosaic reference genome, which at some positions represents rare structural configurations and entirely omits sequences that are found in the majority of individuals. The absence of these sequences from the reference genome hinders their functional characterization leading to a less-than-complete understanding of the sequence content present in the majority of humans. We used a fosmid clone strategy to specifically focus on the characterization of human sequences that are not in the reference assembly and have therefore not been annotated for functional elements or systematically genotyped.
In this study we identified 720 distinct loci ranging in size from 1–20 kbp in length as well as several thousand additional smaller segments <1 kbp in length. We have determined that more than half map to the euchromatin with a disproportionate fraction mapping within the last 5 Mbp of human chromosomes (Supplementary Fig. 1
). A remarkable feature of these sequences is their degree of copy-number polymorphism. ArrayCGH analysis indicates that 18–37% of the assembled sequence contigs vary in copy number, with 80% of the genotyped variants having a minor allele frequency >10% among the 28 individuals surveyed (). Experimental and computational comparisons with chimpanzee DNA suggest that at least 94% arose as a result of deletions that occurred within the human lineage.
Many of the common insertions show striking differences in allele frequency among populations, a pattern suggestive of either selection or genetic drift since the migration of humans out of Africa (, Supplementary Table 8, Supplementary Table 9
). We observe that the average insertion allele frequency for the variable loci was significantly greater in African populations when compared to European or Asians (YRI versus CEU p = 0.0003 and YRI versus ASN p = 0.005, 1 sided t-test). The 3.9-kb novel insertion within the first intron of the LCT
gene is illustrative. Our initial survey suggests that this insertion sequence is prevalent among the Yoruba (86%) and Asian samples (63%) but is present at a much lower frequency among CEPH Europeans (11%). These findings raise the possibility that the additional sequence within this haplotype may play a role in regulating expression of this gene. The complete sequence of this insertion sequence (AC20193) now allows this hypothesis to be directly tested.
An important question going forward is how well de novo
assembly methods using next-generation sequence data compare to the clone-based approach we have described here. We had the opportunity to compare an Illumina SOAP de novo
against the clone-based discovery on the same individual genome (Supplementary Note
). We found that many of the larger novel contigs were only partially represented (50–60%) in a 30X de novo
assembly, and in more than a third of studied cases novel contigs were fragmented—mapping to two or more scaffolds instead of being placed in the same region. In many cases, the fragmentation corresponded to common repeats disrupting the contiguity of the novel sequence. In regions largely devoid of retrotransposons, de novo
sequence assemblies using NGS datasets perform quite well. These results highlight both the limitation of de novo
sequence assembly using NGS and the value of high-quality clone-based data to resolve and integrate these sequences into the reference genome. Nevertheless, there are advantages to de novo
assembly. The de novo
sequence assembly identifies 2–3 times more novel sequence per genome when compared to our results from 0.3X sequence coverage per genome, suggesting that the methods are complementary. Surprisingly, only 2.9% of our singletons from NA18507 (average size ~790 bp) were identified in the de novo
assembly. Since these smaller insertions require more characterization, the significance of this discrepancy is unclear.
The major benefit of our approach is the ability to directly obtain high-quality sequence for the insertion loci by complete sequencing of corresponding clone inserts at a quality commensurate with that of the human reference genome. While no complete missing genes were discovered, we did identify 477 elements that have been conserved over evolutionary time, six of which appear to correspond to exons from RefSeq genes as well as 26 loci having support from multiple mRNA-seq reads. Moreover, we demonstrate that these high-quality sequences can be utilized to accurately genotype these regions using next-generation sequence sets produced from the 1000 Genomes and other projects. The complete sequence of these and other loci will facilitate their functional characterization as they can now be incorporated into future genotyping platforms, expression microarrays, and ultimately future genome assemblies to provide a more accurate representation of the organization and genetic variation of the human genome.