Heterochromatin is a major component of metazoan and plant genomes (e.g., ~20% of the human genome) that regulates chromosome segregation, nuclear organization, and gene expression (1
). A thorough description of the sequence and organization of heterochromatin is necessary for understanding the essential functions encoded within this region of the genome. However, difficulties in cloning, mapping, and assembling regions rich in repetitive elements have hindered the genomic analysis of heterochromatin (5
). The fruit fly Drosophila melanogaster
is a model for heterochromatin studies. About one-third of the genome is considered heterochromatic and is concentrated in the pericentromeric and telomeric regions of the chromosomes (X, 2, 3, 4, and Y) (5
). The heterochromatin contains tandemly repeated simple sequences (including satellite DNAs) (9
), middle repetitive elements [such as transposable elements (TEs) and ribosomal DNA], and some single-copy DNA (10
The whole-genome shotgun sequence (WGS3) was the foundation for finishing and mapping heterochromatic sequences and for elucidating the organization and composition of the nonsatellite DNA in Drosophila
). WGS3 is an excellent assembly of the Drosophila
euchromatic sequence, but it has lower contiguity and quality in the repeat-rich heterochromatin. We undertook a retrospective analysis of these WGS3 scaffolds (11
). Moderately repetitive sequences, such as transposable elements, are well represented in WGS clones and sequence reads, but they tend to be assembled into shorter scaffolds with many gaps and low-quality regions because of the difficulty of accurately assigning data to a specific copy of a repeat. The typical WGS heterochromatic scaffold is smaller [for scaffolds mapped to an arm, N50 ranged from 4 to 35 kb (11
)] than a typical WGS euchromatic scaffold (N50 = 13.9 Mb) (5
). Relative to the euchromatic scaffolds, the WGS3 heterochromatic scaffolds have 5.8 times as many sequence gaps per Mb, as well as lower sequence quality.
To produce the Release 5 sequence, we identified a set of 10-kb genomic clones from a library representing 15× clone coverage by paired end reads (mate pairs) and used this set as templates to fill small gaps and improve low-quality regions (11
). Higher-level sequence assembly into Mb-sized linked scaffolds used relationships determined from bacterial artificial chromosome (BAC)–based sequence tag site (STS) physical mapping (see below) and BAC end sequences. In addition to the WGS data, we incorporated data from 30 BACs (3.4 Mb; 15 BACs finished since Release 3) that were originally sequenced as part of the euchromatin sequencing effort (5
Sequence finishing resulted in fewer gaps, longer scaffolds, and higher-quality sequence relative to WGS3 (fig. S1
). About 15 Mb of this sequence has been finished or improved, and 50% of the sequence is now in scaffolds greater than 378 kb (N50). summarizes the Release 5 sequence statistics by chromosome arm. Improved sequence was generated for 145 WGS3 scaffolds, and a set of 90 new scaffolds were produced by joining or filling 694 gaps of previously unknown size between WGS3 scaffolds. The relationships between the initial WGS scaffolds and the Release 5 scaffolds can be complex ( and figs. S2 to S7
); for example, there were eight cases in which small scaffolds were used to fill gaps within larger scaffolds, and two scaffolds whose gaps interdigitated. As expected, the sequence consists largely of nests of fragmented TEs, and most remaining gaps are bounded by TEs or simple sequence repeats, including simple repeats not previously described (). The quality of the improved sequence was measured by calculating the estimated error rates within 10-kb sliding windows (overlapping by 5 kb) on the consensus sequences (11
). For all but 11 of 1832 10-kb regions not overlapping one of the known TEs, the estimated error rate is less than 1 per 17,986 base pairs (bp), well below the accepted standard for finished genomic sequence of 1 error per 10,000 bp.
Table 1 Status of Release 5. Sequence statistics for the chromosomes are divided into regions contiguous with the euchromatic arm sequences (e.g., Xh) and regions mapped cytologically to those chromosome arms but not currently connected (e.g., XHet). Bac-Based (more ...)
Fig. 1 Comparison of WGS scaffolds to the corresponding Release 5 scaffold. WGS scaffolds (gray, same orientation; tan, opposite orientation) are diagrammed above the Release 5 scaffold (blue). Sequence gaps (thin horizontal lines) in WGS scaffolds are indicated. (more ...)
Fig. 2 Sequenced regions of D. melanogaster pericentromeric heterochromatin. The heterochromatin extends proximally from the euchromatin (black) and includes sequenced and assembled regions (aqua) and unsequenced regions (gray). The actual gap sizes between (more ...)
Concurrent with the sequence-finishing effort, we constructed an integrated physical and cytogenetic map to describe the overall structure of the pericentromeric heterochromatin. This map was essential for ordering, orienting, and linking WGS sequence scaffolds into larger BAC contigs and Release 5 scaffolds. Heterochromatic sequences at the centric ends of the Release 3 arm sequences were represented in BAC-based physical maps of the euchromatic and telomeric portions of the chromosomes (12
), but most heterochromatic scaffolds had not been mapped in large-insert clones or localized to specific sites on the chromosomes.
BAC-based STS content mapping of WGS3 scaffolds, using 354 probes designed from genomic sequence and five BAC libraries (11
), extended and linked many scaffolds into larger BAC contigs. The BAC map incorporates scaffolds spanning 13.4 Mb of the WGS3 assembly and links 14 WGS3 scaffolds to the Release 3 arm sequences (). In regions proximal to the arm assemblies, it links 130 WGS3 scaffolds into 25 multiscaffold BAC contigs and yields 21 single-scaffold BAC contigs () (11
). The largest BAC contig links 20 WGS3 scaffolds spanning 1.7 Mb.
Summary of the integrated physical and cytogenetic map assembly. N/A, not applicable.
We used fluorescence in situ hybridization (FISH) to map BAC contigs and sequence scaffolds to specific cytogenetic locations in mitotic chromosomes (11
). The high repeat content of heterochromatin required the use of single-copy probes [P-element insertions (15
) and cDNA clones (17
)] that could be assigned to specific sequence scaffolds. We also used BAC probes that had sufficient single-copy sequences to provide unambiguous localizations (11
) (fig. S8
). The physical and cytogenetic mapping results and previously published data were used to produce an integrated map of pericentromeric heterochromatin (11
). We present cytogenetic locations for 15 BAC contigs linking 80 scaffolds and an additional 14 scaffolds that were linked to chromosome arms; these localized scaffolds span 11.2 Mb of pericentromeric heterochromatin in the WGS3 assembly (). Currently unlocalized are 50 WGS3 scaffolds in 31 BAC contigs, as well as an additional 63 WGS3 scaffolds larger than 15 kb that are not represented in the BAC map. Four scaffolds larger than 15 kb and not represented in the BAC map were incorporated into Release 5 by sequence finishing (11
Integration of the map and sequence-finishing information led us to define three classes of Release 5 heterochromatic scaffolds: (i) contiguous with the assembled euchromatic arms and extending them farther into pericentromeric heterochromatin (chromosome arm “h”); (ii) mapped to specific chromosome arms with partial information on order and orientation and concatenated into “arm” files (arm “Het”); and (iii) unmapped and concatenated into a single file (arm “U”). The improved, mapped Release 5 scaffolds are diagrammed relative to the chromosome arms in ; see (11
) for analysis of sequences and maps by chromosome.
Fig. 3 Integrated map of D. melanogaster pericentromeric heterochromatin. The cytogenetic reference map of the heterochromatic regions of the chromosomes with numbered divisions (h1 to h58) and centromeres (C) is shown (22). The fourth chromosome (h58 to h61) (more ...)
We have demonstrated substantial progress toward our goal of assembling and mapping the components of heterochromatin that are not simple repeats, and have shown that heterochromatic regions containing single-copy genes and a high density of transposable elements can be assembled into high-quality, contiguous sequence. How can we generate an even more complete genomic understanding of Drosophila
heterochromatin? The tiling path of overlapping BACs spanning the Release 5 sequence (11
) provides templates for gap closure and scaffold extension in the regions that contain middle-repetitive elements and single-copy genes. Progress can also be made in localizing more sequences by performing FISH with additional cDNAs, BACs, and transposon insertions from other collections (19
). Restriction fingerprints of tiling path BACs will also provide an independent benchmark to evaluate the accuracy of finished sequence assemblies (21
). The apparent absence of BACs covering various remaining gaps likely reflects the presence of extensive simple sequence arrays, which are unlikely to be completely closed as the map and sequence are improved. New technologies will be required to determine the sequence and structure of these highly repetitive regions. However, an achievable goal using current technologies is to produce complete maps and sequence assemblies for the single-copy and middle-repetitive components of the heterochromatin, combined with cytological definition of the locations and structures of large blocks of tandemly repeated simple-sequence DNA.
Our results suggest that elucidating the organization and composition of heterochromatic regions in other organisms is a practical goal. However, our ability to substantially improve the sequence and maps required three critical components: (i) a high-quality WGS sequence assembly; (ii) a high-depth collection of precisely sized and aligned genomic clones for sequence finishing and gap closure; and (iii) physical and cytogenetic mapping to deduce relationships between WGS scaffolds. The STS content-mapping experiments benefited greatly from the availability of large-insert BAC libraries produced by fragmenting genomic DNA with three different restriction enzymes and with physical shearing. Analysis of heterochromatin in other genomes would also benefit from improved algorithms that can successfully and accurately assemble sequence of regions rich in repeated DNA.