Overview of the finishing strategy
Both of the sequencing centers that participated in the finishing effort, Lawrence Berkeley National Laboratory (LBNL) and the Human Genome Sequencing Center, Baylor College of Medicine (HGSC), have extensive experience in finishing the sequence of individual BAC clones. To take advantage of this expertise, we reduced the problem of finishing a WGS sequence to one of finishing a set of overlapping BAC clones. This approach allowed us to use the sequence assembly software phrap [10
], which provided an independent assembly of the raw trace data and an estimated error rate. Thus, apart from filling gaps and improving sequence quality, we were able to compare two independent assemblies of the same trace data. LBNL took responsibility for chromosome arms 2L, 2R, 3R, 4 and the proximal half of the X chromosome (80.6 Mb), and HGSC chromosome arm 3L and the distal half of the X chromosome (36.4 Mb).
We previously generated a physical map of the major autosomes [11
]. BAC-based maps of the X chromosome [12
] and chromosome 4 [13
] were also available (Figure ). We selected a tiling path of BAC clones spanning the physical maps and Release 2 sequence using the available sequence-tagged site (STS) maps and BAC end sequences. The WGS sequence traces and end sequences of the tiling path BACs were assigned coordinates based on their order in the Release 2 assembly. We then binned all the traces belonging within a particular BAC and extending 500 bp beyond the end sequence of that BAC. These traces were combined with sequence traces (1.5 × sequence coverage) that had been generated from libraries made from the individual tiling-path BACs [1
] and assembled using phrap. These assemblies formed the starting point for our finishing efforts.
Figure 1 Status of the Drosophila melanogaster euchromatic genome. Each chromosome arm is represented by a black horizontal line with a circle indicating its centromere. For each arm, seven tiers of information (A-G) are presented. (A) Each vertical green line (more ...)
To generate Release 3, we closed gaps by sequencing 518 3-kb subclones, 565 10-kb clones and 290 fragments generated by PCR; we improved quality with 15,344 custom primer sequence runs. To assess the accuracy of the individual BAC assemblies, we compared in silico
restriction maps generated from the sequence of each BAC to restriction fingerprints essentially as described [14
]. A set of Bam
RI and Hind
III digests was used to verify the assembly of 618 BACs. For 328 BACs, only two of the three enzyme digests were available to verify the assembly. The finished BAC clone sequences were submitted individually to GenBank.
The assembly of the individual BACs into entire chromosome arms was verified in two ways. First, BAC tiling path clones were mapped by in situ
hybridization to the polytene salivary gland chromosomes of third instar larvae, which serves as an unambiguously correct physical map. In situ
hybridization data for 547 clones in the BAC tiling path have been reported previously [11
]. Images documenting the localization of 915 (96%) BACs in the tiling path can be found in the GadFly Genome Annotation Database [15
]. Second, the unique end sequences of 8,424 BACs were aligned to the assembled chromosome arm sequences. If the fully sequenced BACs were not assembled in the correct order and orientation, we would expect to find regions of the genome where those end sequences fail to align to positions between 100 and 200 kb apart; such regions of misaligned paired BAC end sequence were not observed in the final Release 3 chromosome arm assemblies. The Release 3 chromosome arms are composed of 13 sequence scaffolds and contain 31 sequence gaps. A sequence scaffold is defined as a set of contiguous sequence contigs, ordered and oriented with respect to one another; within a scaffold, the gaps between adjacent contigs are of known size and are spanned by clones [1
]. Physical map gaps are gaps between scaffolds; in these cases, no clones that span the gap have been identified.
Refinements to the physical map
A BAC-based physical map of the X chromosome was constructed by the European Drosophila Genome Project (EDGP) [12
] and used as a starting point for BAC-based sequence assembly of the X. In our finishing work, we replaced mismapped BACs at 13A and 14D. We also filled nine clone gaps in the physical map, identifying BACs that span them by comparing the paired end sequences of 9,869 BACs [1
] with the Release 2 sequence assembly. In Release 3, the X chromosome is in three scaffolds of 10.5, 10.8 and 0.4 Mb, and has two physical map gaps at 9EF and 20A, estimated to be 150 kb and 200 kb in size, respectively (Table ). Several short tandem arrays lie within 3 kb of the 9EF gap, but it possesses no other remarkable sequence features. Complex nests of transposable elements flank the gap at 20A, which lies within the centric heterochromatin; such nests of transposable elements are common in the proximal regions of the chromosome arms [9
Our BAC-based physical map of the autosomes [11
] was used as a starting point for sequence assembly of the second chromosome. The left arm of chromosome 2 (2L) is in two scaffolds of 21.7 and 0.8 Mb and has one clone gap at 39D-E that contains 100-200 tandem repeats of the histone gene cluster [16
]. Each repeat contains five histone genes: Histone H1 (His1), Histone H3 (His3), Histone H4 (His4), Histone H2A (His2A)
and Histone H2B (His2B)
. Two predominant forms of the repeat, 4.8 and 5.0 kb, have been previously described and differ primarily in the size of the spacer DNA between His1
. The variants are not interspersed with one another; but form segregated clusters. The BAC at the distal side of the gap has at least three copies of the 5-kb variant and the BAC at the proximal side of the gap has at least two copies of the 4.8-kb variant.
The right arm of chromosome 2 (2R) is in three scaffolds of 1.5 Mb, 14.3 Mb and 4.5 Mb, with two clone gaps, one in 42B and another in 57B, estimated to be 100 kb and 300 kb, respectively (Table ). The clone gap in 42B is associated with 50 to 100 copies of a previously uncharacterized 596 bp repeat showing weak similarity to RNA-directed RNA polymerase, interspersed with 1-kb units that have 49% nucleotide similarity to the 1731
transposable element. The other clone gap on 2R, in 57B, is flanked on the distal side by 120-bp tandem repeats with similarity to snoRNAs [17
], and on the proximal side by a different 120-bp repeat.
The left arm of chromosome 3 (3L) is in two scaffolds of 5 Mb and 18.5 Mb, with a single clone gap at 64 C, estimated to be 100 kb (Table ). The right arm of chromosome 3 (3R) has no euchromatic clone gaps and is represented in a single sequence contig of 27.9 Mb.
A BAC-based physical map of chromosome 4 [13
] was used as a starting point for sequence assembly of this arm. This map has two clone gaps, one at 102A and the other at 102EF. The gap at 102A was estimated to be very small and was filled with 46,766 bp of sequence in Release 3. The gap at 102EF was estimated to be 100 kb and remains the only clone gap on chromosome 4 in Release 3. In Release 3, chromosome 4 exists in two scaffolds 1.2 Mb and 0.03 Mb. The gene CG17467
extends into the gap from the proximal side, and the CG18026 (CAPS)
gene extends into the gap from the distal side; these sequences can be used as starting points to fill the gap for Release 4 of the genome sequence. The CAPS gene is found in a 64-kb scaffold in WGS3 that extends 28.5 kb into the gap, ending in approximately 2 kb of the simple 9-bp repeat CATAATAAT.
Assessment of the quality of Release 3
The total size of the euchromatic portion of the Drosophila genome in Release 3 is 116,914,271 bp. The status of each chromosome arm, including length, number of physical map gaps, number of finished and unfinished BACs, number of sequence gaps, and sequence quality, estimated as error rate in a sliding window of 100 kb, is shown in Table . The positions of the remaining gaps are diagrammed in Figure . The euchromatic genome is assembled into 50 contigs; 50% of the genome is contained in contigs (N50) of 14 Mb or greater, and 99% (N99) is in contigs greater than 526 kb. A total of seven physical map gaps and 37 sequence gaps remain (Table ). The sequence is highly accurate: 98.7% of the base pairs are contained within 100-kb regions having estimated error rates of less than one per 100,000 bp. The estimated error rate was also determined for each of the 950 BACs that comprise the tiling path. All have a phrap estimated error rate of less than one in 30,000 bp, and 875 (92%) are estimated to have less than one error in 100,000 bp.
Extending the sequence
Our finishing efforts made modest extensions toward the telomere and into the euchromatin-heterochromatin boundary region at the base of each chromosome arm. We can recognize the telomeres by their distinctive sequence features: variable numbers of HeT-A
transposable elements at their extreme ends, followed proximally by variable numbers and types of Taq minisatellite repeats (TAS elements [18
]). The transition from euchromatin to heterochromatin is gradual. The boundary between euchromatin and centric heterochromatin has been defined cytogenetically [19
], but at the molecular level it is simply characterized by a significant increase in the density of transposable elements [9
] and other repetitive DNA sequences. We report in this paper on that portion of the genome contained in the mapped scaffolds of Release 2, with the exception of four small scaffolds on chromosome 2 that we know lie in heterochromatin; these regions include all of the euchromatin and extend into heterochromatin, as defined cytogenetically [5
The sequences found at the telomeric and centromere proximal ends of each chromosome arm in Release 3 are as follows.
The sequence at the telomere of the X chromosome in Release 3 is contained in BACR13J02, but this BAC is not completely assembled and does not extend to the TAS repeats. The Release 3 sequence of the X chromosome extends proximally approximately 10 kb beyond Release 2 toward the centromere. The proximal 38 kb of Release 3 are not covered by BAC clones; we used 10-kb clones from the WGS in order to produce this sequence. The WGS3 assembly extends beyond Release 3 by 42 kb, ending in 235 copies of the simple repeat TAGA, a known heterochromatic satellite repeat [20
The Release 3 telomeric sequence, extending approximately 5 kb farther than that of Release 2, ends in 11 copies of the TAS repeat. The proximal sequence of chromosome arm 2L ends in 15 kb of nested transposable elements.
The Release 3 telomeric sequence ends in two copies of a 235-bp repeat similar to the previously described telomeric sequence in the TAS element. The proximal sequence of chromosome arm 2R is highly enriched in transposable element sequence.
There has been no change in the sequence of the 3L telomere in Release 3 compared to Release 2. It has a single copy of a 458-bp sequence with 98% similarity to the minisatellite TAS repeats of 2L. The proximal sequence of chromosome arm 3L extends beyond Release 2 by 90 kb and is highly enriched in transposable element sequence.
The telomere of chromosome arm 3R in Release 3 contains six copies of a 983-bp repeat similar to the previously described telomeric sequence in the TAS element; it extends the Release 2 sequence by approximately 6 kb. We did not substantially extend the proximal sequence of chromosome arm 3R in Release 3. The WGS3 assembly extends 774 bp beyond Release 3 and ends in 67 tandem copies of a nearly perfect simple repeat TATAA, a known heterochromatic satellite repeat.
The most distal BAC on chromosome 4, BACH59K20, is not anchored to the 1.2-Mb main scaffold. This BAC contains over 32-kb of sequence not mapped to chromosome 4 in Release 2. Analysis of the sequence of this BAC shows that it should have been appended in inverted orientation, because it ends in two copies of an approximately 700 bp portion of a HeT-A element, whose poly(A) tail is oriented toward the centromere. The sequence of chromosome arm 4 in Release 3 extends proximally to that of Release 2 by nearly 60 kb. This extension contains four 8 kb tandem imperfect repeats, each bounded by a narep 1 repeat. The distal repeat contains the plexin B gene; the three proximal repeats contain related protein-coding genes CG32010, CG32011 and CG32009.
Comparing Releases 2 and 3
The Release 2 sequence of the euchromatin contained 1,107 gaps of which we were aware. These 'declared' gaps were represented in the sequence by a string of Ns corresponding in length to the estimated gap size, or by 1,000 Ns if we were unable to estimate the size of the gap. Our comparisons of Releases 2 and 3 identified 424 additional insertions or deletions greater than 20 bp, which we refer to as 'undeclared' sequence gaps (Table , Figure ). Now that more than 97% of the gaps in the euchromatic sequence have been filled, we are able to examine the sequences that were missing in both the declared and undeclared gaps. The sequences missing in each gap were classified into seven categories: unique sequence, transposable element, homopolymer, simple repeat, tandem duplications, local misassembly, or gross misassembly. In addition, we identified individual base-pair differences, as well as insertions and deletions of less than 20 bp.
Sequence content of gaps in Release 2
In the WGS assemblies, the sequence coverage of the X and Y chromosomes is expected to be less than for the autosomes. The autosomal coverage was estimated to be 12x [1
]; thus, sequence coverage is expected to be 9× for the X chromosome, and 3x for the Y, assuming an equal number of male and female embryos in the collection used to make the DNA for the WGS. Presumably as a result of its reduced sequence coverage, the Release 2 sequence of the X chromosome has 2.5 times as many gaps (743, representing 49% of the total) per million bases as those of the autosomes and more than half the gaps in the euchromatic portion of the genome that lie in unique sequence map to the X chromosome. The Y chromosome is almost entirely heterochromatic and Y chromosome sequences are limited to small WGS scaffolds [21
Repetitive elements are difficult to assemble in a whole-genome shotgun strategy. As a consequence, one of the initial steps in the Release 2 assembly was to identify known repetitive elements, including transposable elements, and remove their traces from the early steps of the assembly process [2
]. After the unique sequence was assembled and sequence contigs were ordered and oriented, those transposable element sequence traces were added back. Sequence traces belonging to transposable element families were assembled using an aggressive strategy that resulted in composite sequences for most of the transposable elements in Release 2. Part of the finishing effort has been to determine the sequence of each individual element. Of the 1,572 full and partial transposable elements in Release 3 [9
], only 380 are identical to the corresponding Release 2 transposable element sequences. Of these identical elements, 323 are shorter than 900 bp and could be assembled with two reads if those reads contained sufficient unique sequence to be unambiguously placed. The sequences of the remaining 57 transposable elements are identical between Release 2 and Release 3 because their sequence was determined by the Berkeley Drosophila Genome Project (BDGP) as part of the finishing process leading to Release 2 [1
]. Approximately a third of the total number of gaps in Release 2 euchromatin are the consequence of transposable elements. A slightly higher fraction of the gaps are the consequence of simple repeats.
Duplications in the Drosophila
genome are rare in comparison to the number found in mammalian genomes. We observed tandem repeats whose repeat units range in size from 10 bp to 30 kb. The largest region comprised of tandemly duplicated repeats in the euchromatic portion of the Drosophila
genome is the histone cluster at 39B described above. As a prelude to a more rigorous analysis of repeats, we have searched for perfect direct or inverted repeats over 100 bp in length that are not transposable elements or low-complexity sequence. We identified 0.9 Mb of repeated sequence (less than 1% of the euchromatic genome), 40% mapping to chromosome X, 40% to chromosome 2, 15% to chromosome 3 and 3% to chromosome 4. Although there are more repeats on chromosomes X and 2, they appear to be randomly distributed and are not clustered at the centromeres or telomeres. This is in contrast to the increased number of tandem duplications found in the pericentric regions of mammalian chromosomes [22
]. Misassemblies of tandem repeats account for only 5% of all gaps. Local misassemblies resulting in small insertions, the result of low-quality sequence in Release 2, are also rare, constituting only 2% of the gaps.
We identified only seven gross misassemblies in Release 2. Six resulted in large sequence insertions or deletions in Release 2, and one resulted in a large inversion. In addition, a polymorphism in the isogenic strain resulted in an eighth large-scale sequence difference between Releases 2 and 3. This polymorphic tandem duplication consists of more than 30 kb of sequence bounded by hobo and cruiser transposable elements; half of the BACs mapping to this location contain one copy of the repeat and the other half have two. Transposable element sequence is associated with the ends of all of the misassembled regions. On the X chromosome, at position 3.4 Mb, 88 kb of complex transposable element sequence in Release 2 has been replaced in Release 3 with a single copy of a roo element and a gap. When complete, this region is expected to contain three roo elements. On 2L there is one misassembly, at 1.8 Mb. A declared gap estimated to be 1,400 bp and involving an mdg3 transposable element was filled in Release 3 with 50 kb of sequence. On 2R there are three misassemblies, mapping to 0.8 Mb, 14.8 Mb and 20.1 Mb. The first is associated with four declared gaps within 135 kb of Release 2 sequence. That sequence has been replaced with 60 kb of sequence in Release 3, bounded by a G element at one end and an invader 3 element at the other. The second misassembly is a declared gap, estimated to be 11 kb, that was filled with 37 kb of sequence in Release 3. In the third, 54 kb of complex transposable element sequence in Release 2 has been replaced with a single complete 10-kb roo element in Release 3. On 3R, a 6-kb segment in Release 2 is replaced in Release 3 with 33 kb. The Release 2 sequence of chromosome 4 contains a large misassembly that inverts the distal third of the chromosome (John Locke, personal communication). The inversion occurred at position 751,419, the location of a portion of a 1360 element. The misassembled Release 2 sequence ends with a TAATAATA(27) repeat that maps starting at position 817,409 bp in Release 3.
We identified single base substitutions and single nucleotide insertions and deletions between Release 2 and Release 3. Excepting transposable elements, the rate of base changes is one per 22,000 bp for the autosomes and one per 5,000 bp for the X chromosome. Within transposable elements, the rate of base changes is one per 124 bp; this high error rate is largely a consequence of comparing the Release 3 sequence of individual elements to the composite sequences in Release 2.
Comparison of WGS assemblies to Release 3
In addition to comparing the Release 3 sequence to the two sequence releases previously deposited in GenBank, we also compared Release 3 to each of three pure WGS assemblies. These comparisons allow an assessment of how well a WGS approach works on a 180 Mb metazoan genome. Although the specific software we used is proprietary, the underlying algorithms and the logic have been fully described [2
] and the basic approach replicated by others [23
]. An analysis of these assemblies in light of the modifications to the algorithmic strategy made between assemblies should inform the development of WGS assembly software in general. The sequence traces used in these assemblies are available from the GenBank trace repository and Release 3 offers a high-quality sequence for comparison, providing an important resource for testing new WGS assembly algorithms. The sequences that comprise the WGS2 and WGS3 assemblies are also available for benchmarking.
Three assemblies of the WGS data were made, each using a different version of the Celera Genomics assembly software. WGS1 used 3,156,000 paired-end shotgun reads from genomic plasmids and 12,152 paired BAC-end reads. An additional 62,000 reads that had been erroneously removed during the quality screening process for WGS1 were added to the input data set for the next two assemblies (WGS2 and WGS3). The assembly algorithm that produced WGS1 did not adequately handle repeat resolution; a variety of improvements, including a different method of placing repeat traces, better identification of repeat traces, and the introduction of a sequence correction algorithm, resulted in better subsequent assemblies, particularly of repetitive regions.
The output of the assembly process is a collection of scaffolds made up of ordered and oriented contigs linked together by paired end sequences. Paired-end reads of clones whose size falls within a tight distribution allow the length of every gap between two contigs of a scaffold to be characterized with an expected mean and standard deviation. There were 732,000 pairs of sequences produced from the ends of 2-kb plasmid libraries, another 548,000 pairs of sequences from the ends of 10-kb plasmid libraries, and 12,152 pairs of BAC-end sequences a mean distance of 130 kb apart from each other. Table presents statistics characterizing the scaffolds in the three assemblies. Table presents the same statistics for the scaffolds that align to the Release 3 euchromatic sequences. The scaffolds that do not align to the chromosome arms, totaling 16.5 Mb of sequence and spanning 20.7 Mb, derive from the heterochromatic regions of the genome (see [5
] for more information on the content of these scaffolds.)
Scaffold, contig and gap statistics for the three assemblies
WGS scaffolds that align to the euchromatic portion of Release 3
The output of the assembler contains only the sequence reads that assemble with high confidence into a contig. As a result, 15-25% of the sequencing reads, most of which come from heterochromatin or interspersed repetitive elements, do not form part of an assembly. Table provides evidence that the total amount of data that was reliably assembled increased as the assembly algorithms improved. In the later assemblies, more sequence from heterochromatic regions of the genome is reported, but in a more fractured state than for euchromatic regions. These additional assembled pieces, although constituting a modest fraction of the sequence, constitute a large percentage of the number of scaffolds and contigs. The 10 largest scaffolds of the three assemblies constitute over 80% of the sequence and are almost identical.
In all three assemblies, a small number of large scaffolds cover the 116.9 Mb of the Release 3 euchromatic sequence (Table ). WGS3 scaffolds span 99.91% of Release 3 euchromatic sequence and extend beyond it by 0.84 Mb. Mean gap length increased between WGS2 and WGS3, but there are half as many gaps, indicating that WGS3 resolved many of the smaller interspersed repeats. Scaffold N50 lengths are similar in each assembly (Table ), as all of the mapped scaffolds are large. Of the 116.9 Mb of Release 3 euchromatic sequence, WGS1 provided 97.6%, WGS2 provided 98.4%, and WGS3 provided 98.9% of the sequence.
The content of sequence gaps in WGS1, WGS2 and WGS3 was determined by comparison to Release 3. The majority of this sequence - 73%, 64%, and 70%, respectively - is repetitive, either tandem repeats or transposable elements.
We compared the order of the Release 3 euchromatic sequences with the order of the contigs and scaffolds in each of the WGS assemblies (see Materials and methods). Assuming that Release 3 is correct, we manually examined and categorized these discrepancies. Table shows the number of incorrectly ordered segments in each category and the total number of base pairs involved in those segments. Occasionally, what look like separate scaffolds actually overlap; a contig at the end of one scaffold contains sequence that belongs in a gap at the end of the other scaffold. These interleaved scaffolds induce a subtype of local order error, as seen in Table . In later versions of the assembly, the number of interleaving failures actually increased. Until now, all the algorithmic solutions have simply considered scaffolds to be non-overlapping, and have ignored this phenomenon. The order and orientation of the scaffolds themselves is correct, so these are considered separately from local errors.
Order and orientation errors in the WGS assemblies compared to Release 3
We measured the base-pair accuracy of the sequence in the WGS assemblies. We found 46,722 discrepancies in WGS1, 26,355 in WGS2, and 13,095 in WGS3 corresponding to 3.99, 2.20, and 1.09 errors per 10,000 bp, respectively. The National Human Genome Research Institute (NHGRI) standard for finished sequence is less than one error per 10,000 bp. With the exception of gaps, WGS3 nearly meets this standard for finished sequence over the entire assembled sequence. However, most of the errors are in the reconstruction of repetitive sequence (Table ); in the unique sequence, all three assemblies exceed the error rate standard.
Sequence error rates for the WGS assemblies
Furthermore, assemblies tend to be less accurate at the tips of contigs, because the number of reads covering the ends is much lower than in the middle of the contig. As shown in Table , sequence quality rises when 10-50 bp are eliminated from the ends of each contig. Though the WGS sequences contain more gaps than would be considered acceptable for 'finished' sequence, the sequences that are present are highly accurate.