We describe a clone resource from 17 human DNA samples that provides 135-fold physical coverage of the human genome. The corresponding catalogue and clones can be used to further characterize almost any segment of human euchromatin. We used this resource to assess breakpoint characteristics of 1,054 events. The nature of our experimental design permitted us to discover more events mediated by larger segments of homology providing a more complete assessment of human genetic variation. Of particular interest are complex events whose sequence features have been difficult to previously assess at a genome-wide level. The high quality and length of the sequenced fosmids combined with defined paralogous sequence events allowed us to quantify alternating sequence matches suggestive of interlocus gene conversion (
Bayes et al., 2003;
Lagerstedt et al., 1997;
Reiter et al., 1997;
Visser et al., 2005).
Using this resource, we obtained the complete structure of several alleles that have been associated with disease, including a deletion variant upstream of the
NEGR1 gene associated with increased body mass index (
Willer et al., 2009) (clone AC210916), two deletion polymorphisms upstream of the
IRGM gene associated with Crohn’s disease (
Barrett et al., 2008;
Bekpen et al., 2009;
McCarroll et al., 2008a) (clone AC207974), and the deletion of the
LCE3B and
LCE3C genes. In total, we conservatively estimate that 1.04% (11/1,054) of the discovered variants are associated with disease. This yield of disease-causing alleles rivals that found by genome wide association studies using SNPs, which has identified 779 genome-wide associations based on genotyping of at least 100,000 SNPs (
http://www.genome.gov/multimedia/illustrations/GWAS2010-3.pdf). Although the functional significance of many of the other structural variants remains to be determined, the clone resource and availability of the complete sequence of variant haplotypes will facilitate future disease association through the rapid design of assays to test for association with disease (
Abe et al., 2009;
An et al., 2009;
Kidd et al., 2007) or direct comparison with short sequencing reads from next-generation sequence platforms (
Kidd et al., 2010;
Lam et al., 2010).
We investigated this approach for 1,024 non-VNTR sequenced structural variants (
Supplementary Table 7) and found that 71% (726/1,024) of the variants are uniquely identifiable with a read-length of 36 bp and uniqueness threshold permitting up to one substitution. This includes 32 inversions—balanced events that are invisible to array-based genotyping approaches. As read lengths increase to 100 bp, we estimate that 88% (902/1,024) of these variants could be genotyped. The construction of complete alternative haplotypes then facilitates the use of read-pair information to distinguish among distinct structural configurations (
Antonacci et al., 2010).
Although, short-read technologies may miss some of the breakpoint sequences, there are many advantages to the application of short-read technology to genome structural variation. This includes the detection of thousands more events per individual genome, especially variants below the detection threshold of the fosmid ESP approach. The dynamic range response and the sequence specificity of next-generation sequencing allow absolute copy number and the identity of duplicated genes to be accurately predicted. One of the strengths of this clone resource, however, is that it permits the iterative assessment of predicted variants. Clones corresponding to structural variants discovered by other methods applied to these 17 individuals, including newly developed approaches such as methods for identifying transposon insertions (
Huang et al., 2010;
Witherspoon et al., 2010), may be retrieved providing complete sequence information for additional events and thereby provide a resource set of sequenced variant haplotypes. The availability of the underlying clones and potential location of the variant within a specific DNA sample provides an approach for more fully exploring the genetic architecture and mutational properties of these regions. Thus, we predict that such a resource will be a valuable complement for understanding the true complexity of human genetic variation as human genomes become routinely sequenced using short-read sequencing technology.