To obtain 95% of the common variation (minor allele frequency >5%), the plan is to make fosmid clone libraries (~40 kb inserts) from the genomic DNA of 48 unrelated females already genotyped in the HapMap, and BAC clone libraries and from 14 unrelated HapMap males with the concomitant production of ~50 Gb of human sequence in the form of end-sequence pairs (see white paper at http://www.genome.gov/Pages/Research/Sequencing/SeqProposals/StructuralVariationProject.pdf
for sample size rationale). The large insert (~150 kb) BAC clone libraries will provide a mechanism by which to obtain sequence information on structural variants18
that are too large to be encompassed in the fosmid inserts, such as those associated with segmental duplications39
and the highly repetitive palindromic sequences of the Y chromosome40,41
. Individuals studied in the International HapMap Project are ideal for this research because they are being characterized for structural variation by other means16-18,37
, may be used for genome-wide variation discovery with full data release, and have already been genotyped for 3.4 million SNPs, making it possible to correlate structural variation with what is currently known about the genetic architecture of the region in question. Hence, genome libraries will be constructed from representative individuals with European, Asian and African ancestry.
Each human genome library will be constructed to tenfold physical coverage per individual and inserts will be end-sequenced. This should capture >98% of each parental haplotype in clones, even after allowing for cloning biases, sequence failure and failure of the end sequence to map to the genome14
. The most important parameter for detecting structural variation in this plan is the insert size variance in both the fosmid and BAC libraries. With standard deviations of 1.5 kb for fosmid libraries, for example, it is possible to detect several hundred sites of structural variation as small as 5 kb per individual. The wider insert size distribution of BAC clones will require putative structural variant clones to be validated by fingerprinting before complete insert sequencing. A further benefit of this initiative is that it is expected to yield ~15-fold greater coverage of human genomic sequence, providing ample substrate for the recovery of previously unknown rare SNPs and smaller insertion/deletion polymorphisms7,8
A key aspect of the plan is to sequence all genomic clones that are discordant with the reference sequence in terms of length or orientation. On the basis of preliminary studies, we expect to identify several thousand sites of structural variation. These will be sequenced to a high degree, allowing base-pair resolution of the structural variants16
. This amount of sequencing is well within the capacity of the genome centres. It is important to note that although some variants will be the result of simple insertions or deletions, others will be embedded in complex regions of the genome, and will have many rearrangements with respect to the reference sequence14,42
. Clones from the library resource may also be useful to various research groups for other reasons. They could be used to close gaps in the human genome sequence and for follow-up investigation of positive ‘hits’ in whole-genome or candidate-region association studies by providing rapid and fairly complete characterization of all SNPs and structural variation on one or more associated haplotypes. In addition, they could be used to compare the ability of platforms to accurately detect different types of variation.
Another goal of the initiative is to genotype the discovered variants in the full set of HapMap samples, thus contributing to an integrated map of SNPs and structural variants. This is especially important because of the many genome-wide association studies currently in progress or planned for the near future. Investigators interpreting these data will encounter the structural variants only through their SNP genotype data. Recognizing that no single technology can adequately genotype all forms of structural variation9,11,43
, this effort, among others, would stimulate technological improvements that would allow rapid, inexpensive and comprehensive assessment of all forms of structural variation. The immediate plan is to use the sequence-validated structural genetic variants from the 62 individuals (48 HapMap females and 14 HapMap males) to evaluate new technologies and to perform cross-platform comparisons of existing technologies, providing a better understanding of false positives and false negatives. The integration of the resulting structural genetic variant map with SNPs will offer clues to their evolutionary history in the genome. Structural variants that arose only once would be expected to show linkage disequilibrium with SNPs on their original haplotype, whereas structural variants that arise repeatedly would be expected to show little linkage disequilibrium with nearby SNPs. Identifying the structural genetic variants in linkage disequilibrium with nearby SNPs would also allow these variants to be tagged by SNPs, facilitating efficient identification of this subset of variants in subsequent association studies.
All sequence data from this initiative, including the corresponding end-sequence pairs and assembled clone insert sequences, will be deposited in NIH-sponsored public databases — the Trace Archive and GenBank, respectively — according to standards already established for large-scale sequencing efforts (http://www.wellcome.ac.uk/doc_wtd003208.html
). Incorporating information from larger and more complex rearrangements presents new challenges to the bioinformatics community. The NIH SNP database (dbSNP) is designed to accept several classes of smaller variant, including SNPs, microsatellite repeats and small insertion or deletion events but not larger variants. It will be necessary, for example, to integrate alternative views of the human genome organization which are linked to the sequenced clones, provide sequence alignments of the structural genetic variants to the reference sequence, and flag regions in which mRNAs or genes could potentially be affected.
We propose the integration of sequence-defined structural genetic variation with the reference sequence and other genetic variation as part of dbSNP. The integrated information should include mapping data, size, structural properties, individual source and linkage disequilibrium with nearby variants, and could be treated as STS-like features (intervals defined by flanking sequence) when annotated against the reference genome assembly. As breakpoints are localized by sequencing and validation, the record can be expanded into sequenced haplotype alternatives. Similarly, public dissemination will benefit from integration with data on common genome browsers (such as that of the University of California, Santa Cruz, and ENSEMBL) as well as other public databases (for example, http://projects.tcag.ca/variation