Using both the NCBI build 36 and Celera assemblies we designed six primer pairs anchored in unique sequences that tiled the three gaps (for two of the three we used Celera contigs inside the build 36 gap to design primers; Table S1 in Additional data file 1). We amplified these regions via PCR from human genomic DNA (Coriell NA15510; see Materials and methods). End sequences of the PCR products matched the gap-flanking sequences, and product sizes on agarose gels closely matched the expected product sizes based on the gap sizing in the Celera assembly (Figure S1 in Additional data file 1). We then attempted to clone the PCR products directly, but sequencing of these showed that we were unsuccessful in obtaining clones that contained the desired product. Next, we produced and assembled small insert (average length 500 bp) 'shatter' libraries from the PCR products [
7]. This approach of breaking difficult regions into much smaller fragments has been used with great success to resolve sequences challenging to the finishing process. These assemblies did provide new sequence extending into each of the three gaps (1,002 of 2,658 bp, 2,229 of 10,166 bp and 1,618 of 5,554 bp), but failed to yield the full sequence spanning any of the gaps (Figure ).
As an alternative approach, we sheared and directly sequenced the gap-spanning PCR products using the 454 Life Sciences GS FLX [
8]. Reads were assembled using a module of the ARACHNE assembler designed for 454 data [
9] (see Materials and methods) since the 454 Newbler assembler does not assemble the sequences into a single contig. For each gap, the 454 reads were successfully assembled into a single, high-quality contig spanning the gap region (Figure ). In one case, a clear second haplotype containing a 108-base deletion was manually assembled from the remaining data. Existence of this second haplotype was also supported by PCR product sizes on agarose gels (Figure S2 in Additional data file 1). In all three cases, the assembly was concordant with the expected region size (Figure S1 in Additional data file 1) and in perfect agreement with all AB 3730 sequence reads previously obtained from PCR product ends and shatter libraries. In the three closed gaps, the sequences obtained solely by 454 were 1.2 to 1.6 kb in length and primarily of low complexity (G+T rich on one strand) but without a clear motif or repeating unit (see below).
A key difference between the 454 methodology and traditional sequencing is that the 454 process has no bacterial cloning step. We theorized that previous failure to close the gaps was due to bias against cloning the gap sequence in standard
Escherichia coli vector systems. To explore this, we designed PCR primers based on the complete 454 assembly and attempted to amplify and clone the regions that had been refractory to the initial clone-based approaches. These efforts yielded subclones; however, multiple lines of evidence demonstrated that all cloned products had deleted the majority of the low-sequence-complexity regions. Examination of capillary sequence traces from the subclones obtained from the PCR products showed that all sequenced subclones were missing the majority of the gap region and that deletions had clearly arisen at different locations in different subclones from each product. In all cases, alignment of the sequence showed the plasmid subclone insert to be 80 to 200 bases in length, while the actual lengths of the genomic sequences from which they were derived are all >1 kb. Finally, noise patterns in trace data from individual subclones suggested that deletion events had taken place during growth of the bacterial colony, resulting in a mixed population. Further, only two clones were found, both from the RPCI-11 Bacterial Artificial Chromosome (BAC) library, which spanned one gap and each had independently deleted the low complexity gap sequence along with a large surrounding region [
2]. We posit that such regions are heavily underrepresented in large insert clone libraries because non-deleting clones containing the gap sequences appear to be non-viable We conclude from these results that the gap sequences are unstable or toxic when propagated in bacterial vectors, resulting in strong bias to remove them.
As part of the Human Genome Project (HGP) work, comprehensive efforts were made to close all gaps by clone-based methods [
10]. All available end sequences from large insert clones were exhaustively analyzed in an attempt to identify clones spanning these gaps. In addition, all available large insert libraries (representing >50-fold physical coverage of the genome [
10]) were probed by hybridization with probes designed from sequences flanking all gaps. Finally, all available primate whole genome sequence assemblies were examined in attempts to find orthologous sequences that could be used to make probes that were then used in hybridization screens. Only with the exhaustion of all of these methods did the HGP gap closure efforts cease. Thus, the gaps remained either because spanning clones could not be unambiguously identified, or because the intervening sequences did not propagate well in the cloning vectors used. We note that despite intense finishing efforts by the sequencing centres to capture all human gene models, two of the three gaps closed in chromosome 15 were located in intronic regions of gene loci (GABRA5 and GABRG3), and that there were also gaps in orthologous regions of all available primate genome assemblies. However, these recalcitrant regions remained uncaptured. We further confirmed that the Celera and HuRef assemblies also fail to span these regions with clones (S Levy, personal communication), and note that efforts by Bovee
et al. [
3] to screen additional Fosmid ends were unsuccessful in closing these gaps.
If the 454 methodology surmounts the problems posed by these regions, we might expect them to be represented in a whole genome shotgun sequence done by 454. We aligned the assembled gap regions to the recently released 454 reads from the Watson genome project [
11] and found spotty coverage of two of the three gaps with reads landing within our finished sequence. This indicates that the 454 whole genome shotgun was able to represent these sequences, but failed to completely cover the gaps.
Analysis of the gap sequences that were recalcitrant to cloning showed they were largely composed of low complexity sequence, highly enriched in G+T on one strand, C+A on the other. To estimate how rare this type of sequence is, we slid a 4 kb window across the genome and measured G+T or C+A content. Using the empirical distribution obtained from these measurements, we computed the probability of picking a 4 kb region by chance with G+T or C+A content as high as any of our three gap regions at P = 0.0002; the likelihood of picking three such regions by chance is therefore extremely low (P = 8 × 10-12). We find, however, that regions (431 genome-wide) with even longer stretches of this type of sequence are present in the finished human genome sequence and captured in large insert clones. We conclude that sequence composition plays a significant role in what makes these regions difficult to clone, but there are likely to be other factors as well.
It has been shown that sequence rich in alternating pyrimidimes/purines tends to adopt Z-DNA conformations and that such DNA conformation tends to be difficult to clone [
12,
13]. In all three cases the sequence recalcitrant to cloning harbored large stretches (271, 485 and 364 bases long, respectively) of alternating pyrimidines/purines of the type that tends to form Z-DNA structure as postulated in Konopka
et al. [
12]. Our genome-wide analysis again revealed longer stretches within finished clones; however, since it is highly unlikely to find three such stretches by mere chance (
P = 10
-9), we conclude that the tendency of these regions to adopt Z-DNA conformation is likely, along with sequence composition, to also contribute to the intolerance of the gap sequences to cloning in
E. coli vectors.
Having closed three gaps on chromosome 15, we sought to determine how many other gaps in the genome might be closable by this method. In NCBI human build 36 there are 260 remaining gaps (excluding chromosome Y and the 29 gaps that contain heterochromatic regions, including centromeres and acrocentric short arms). We carried out a similar analysis to Eichler
et al. [
14], showing that the gaps fall into three classes as defined by sequence composition of the region. Type I gaps are subtelomeric: there are nine gaps in subtelomeric regions containing telomere-associated repeats. Type II contain duplicated euchromatin; this includes 30 pericentromeric gaps (within 1 megabase of a centromere) and 94 gaps flanked by segmental duplications. Type III gaps are in unique euchromatin; these 127 gaps do not show signatures of duplication or structural polymorphism. Based on our work here, we propose they remain gaps because they contain sequences recalcitrant to the standard bacterial cloning methods used for libraries in the HGP.
Class I and II gaps are likely to arise from unresolved structural complexity in the genome and can be attacked by the methodology of carefully reassembling existing tiling paths or by reassembling the area using a single haplotype, as described previously [
2,
3]. The three gaps closed here are representative of class III; the gaps that appear to remain because they contain sequences that are grossly underrepresented or deleted in clone libraries and therefore are likely to be refractory to cloning techniques used in the HGP. Our data suggest that these sequence regions are relatively small, contrary to their sizing in the clone-based assembly, and thus amenable to standard PCR amplification and sequencing by 454 (see Additional data file 2 for size estimations based on other human and primate assemblies).