Sequence data sets
The nature of our comparative genome mapping and sequencing program [
6,
11] has required us to routinely sequence the DNA from many vertebrates. Our data and the anecdotal experiences of others have consistently suggested marked differences among species with respect to the effort required to refine (i.e., finish) shotgun-generated sequence assemblies, even for the same orthologous genomic region. To investigate these differences in a more systematic fashion, we generated and analyzed two comparable sequence data sets.
The first data set (Table ) consists of the human-grade finished sequences generated for 541 BACs from 38 vertebrates, with all sequences orthologous to a 1.9-Mb genomic region (in human) encompassing the cystic fibrosis transmembrane conductance regulator gene (
CFTR); this genomic segment corresponds to ENCODE pilot project region ENm001 [
22,
23]. The number of sequenced BACs and the total amount of sequence in the ENm001-derived multiple-BAC sequence assembly generated for each vertebrate are provided in Table . A species' sequence was included in this data set if PipMaker-derived alignments [
24] of that species' BAC sequences together covered at least 75% of the human ENm001 reference sequence. Importantly, comparative-grade finished sequence was also available for all 541 sequenced BACs in this first data set.
| Table 1Human-grade finished BAC sequences from ENm001. |
The second data set consists of the comparative-grade finished sequences generated for 2,031 BACs from 21 vertebrate species (Table ), with each sequence orthologous to one of 14 ENCODE pilot project regions (Table ). The number of sequenced BACs and the total amount of sequence in the individual BAC clone sequences generated for each species (Table ) or ENCODE region (Table ) are listed. To be included in this data set, more than 75% of each of the ENCODE pilot project regions (Table ) needed to be covered by PipMaker-derived alignments of comparative-grade finished sequence from that species (Table ). Most of the BAC sequences in this second data set have not been refined to human-grade finished sequence. Because of insufficient sequence coverage, BAC sequences were not included in this second data set from three other ENCODE pilot project regions (ENr231, ENr232, and ENr333) that have unusual characteristics and were thus analyzed for illustrative purposes (e.g., see below); in these cases, the number of sequenced BACs (and total Mb of individual BAC clone sequences) for ENr231, EN232, and ENr333 were 64 (10.3 Mb), 51 (8.0 Mb), and 62 (10.0 Mb), respectively.
| Table 2Comparative-grade finished BAC sequences listed by species. |
| Table 3Comparative-grade finished BAC sequences listed by ENCODE region. |
Misassembled sequence
In shotgun sequencing, once a sufficient number of sequence reads are collected based on the size of the starting template (e.g., a BAC), the sequence is assembled computationally (such as with the program Phrap [
25,
26]). Even though the assembly program is operated under conditions we have optimized for our BAC sequences, occasionally an incorrect alignment occurs with closely related but distinct reads, resulting in a misassembled and incorrect consensus sequence. Misassemblies can result from various causes, such as the presence of multiple copies of closely-related repetitive sequences that inappropriately collapse onto one or a few regions. Often, misassemblies incorrectly portray non-adjacent segments as being contiguous. Such problems require manual correction, otherwise the sequence would be of marginal utility (e.g., for accurately identifying genes or characterizing the long-range organization of a genomic region). With the aid of sequence-editing programs, an experienced technician systematically reviews and sorts the misassembled reads, eventually correcting problems and allowing a valid consensus sequence to emerge. However, this manual intervention can be one of the most labor-intensive and costly components of the sequence-finishing process. Details of our sequence-finishing process are provided in the Supplementary Materials associated with Blakesley
et al. [
5].
We have consistently found that certain genomic regions are more prone to sequence misassemblies than others, regardless of the species being studied. For example, considerable effort was required to resolve misassembled sequences for more than 20% of all BACs from ENCODE pilot project regions ENr231 and ENr333; for the other ENCODE regions, less than 5% of the BACs needed such attention. Similarly, certain species stand out as being more prone to sequence misassemblies, regardless of the genomic region being sequenced. For example, baboon BAC sequences typically contain an unusually high frequency of long (2- to 43- kb) inverted and tandem repeats of >90% identity that greatly confound standard sequence-assembly routines. As another example, assembled hedgehog sequences often contained non-contiguous segments that had been incorrectly joined by Phrap. In these cases, more extensive manual effort was required to untangle the misassemblies.
We investigated alternative sequence-assembly programs to Phrap that provide 'read-pair awareness' as a means of reducing the amount of manual correction of the initial sequence assembly required. Of the many programs evaluated, rPhrap (Geospiza) [
27] and Autosort (NFH, unpublished) showed the greatest promise. We thus tested these two sequence-assembly programs by having them assemble the sequence reads generated from 102 BACs that had previously been found to be associated with Phrap-induced sequence misassemblies. Improved sequence assemblies were generated for 68 of these BACs (with complete resolution for 29 BACs and reduction of minor misassemblies for 39 BACs). Direct comparison of the performance of each program revealed that one generally showed more improvement than the other for any particular BAC, but Autosort overall yielded improved assemblies for twice as many BACs as rPhrap. While these findings represent a positive development, it is important to note that these alternative sequence-assembly programs are actually less effective than Phrap for assembling average BAC sequences, due primarily to a great increase in the number of contigs and occasional generation of new misassemblies.
We also tested whether improved sequence assemblies could be generated by separately altering certain Phrap parameters. The shotgun sequence reads for ten of the above BACs were assembled using four different sets of Phrap parameters; each resulting assembly was then analyzed for features we have found to serve as a measure of the manual effort required for sequence finishing (specifically, the total number of contigs, misjoined sequences, uncaptured gaps, contigs which can be manually joined, and groups of reads to remove from contigs). Compared to the assemblies generated with our optimized Phrap parameters, those generated using the 'repeat stringency' set at 99 or 'shatter greedy' were inferior with respect to the above measures, and would have required substantial additional effort to complete the finishing process. Assemblies generated using 'revise greedy' or a 'forcelevel' of 3 (instead of 0) were found to be similar to our standard optimized Phrap assemblies. Based on analyses such as this, we have adopted a routine in which shotgun sequence reads are first assembled with Phrap using our optimized parameters; if problematic misassemblies are noted, the reads are then reassembled with rPhrap and separately with Autosort. The best of the three assemblies is then selected as a better starting point for further refinement, including manual resolution of more complex misassemblies.
Frequency of gaps
When inspecting a sequence assembly, one of the first features to assess is whether the consensus sequence is in a single piece or multiple pieces (i.e., contigs). An assembly consisting of more than one contig indicates that the sequencing procedure failed to generate sufficient sequence reads from the DNA residing between contigs (i.e., gaps). The number of sequence gaps is a first indicator of the amount of effort that will be required to finish an assembled sequence. Early in any sequence-finishing process, each contig must be manually inspected, looking for internal regions of low quality or inappropriately joined segments. As might be anticipated, contig ends typically contain poor-quality sequence, either due to individual sequence reads extending well beyond the average quality length or due to chimeric reads. The amount of effort required to manually inspect and refine contig ends is directly proportional to their number (each gap is associated with two flanking contig ends).
The impact of contig number is most pronounced during the steps involved in establishing contig order and orientation, a key part of the finishing process. One approach involves comparing in silico- and laboratory-generated restriction enzyme digestion-based fingerprints of a BAC. When the assembled sequence is distributed among numerous contigs, in silico-deduced restriction fragments become artificially broken, making it is nearly impossible to derive an accurate restriction enzyme digestion-based fingerprint from the many combinatorial possibilities. A second approach for ordering and orienting contigs involves aligning each contig to an orthologous reference sequence. Once again, interpreting the alignments of multiple small contigs in an assembled sequence is far more difficult than just a few large contigs (especially for sequences from species more distantly related to the reference sequence).
For these reasons, the number of gaps in an assembled sequence is a critical parameter for the sequence-finishing process. We analyzed the first data set for the presence of gaps in the comparative-grade finished sequence of the 541 BACs. Specifically, for each of the 38 species listed in Table , we aligned the comparative-grade finished sequence of each BAC to the corresponding human-grade finished sequence for that species using the program Cross Match [
28]. The resulting alignments were manually examined and corrected when contigs in the comparative-grade finished sequence had an ambiguous endpoint and/or order in the alignment. From the final alignments, we then catalogued the location, size, and sequence composition of all gaps in the comparative-grade finished sequence; further, the number of shotgun-library subclones spanning each gap (as detected by establishing the locations of all sequence reads in the human-grade finished sequence) was also determined. Across the 73.2 Mb of total sequence in the first data set (Table ), we identified 1,528 gaps, corresponding to roughly 23 gaps per Mb. Notably, we found an identical frequency of gaps in an earlier study that involved analyzing the comparative-grade finished sequence of 116 vertebrate BACs [
5]. Those 1,528 gaps together account for 1.1 Mb (or 1.5%) of the total sequence, and range in size from 1 bp to greater than 15 kb, with a median size of 250 bp; roughly a third (506) of the gaps are 500 bp or larger.
Examination of the simple aggregate statistics above fails to reveal the striking differences between individual species with respect to gaps in their assembled genome sequences. For example, the amount of sequence residing within gaps varied more than 15-fold among species (Figure ). For species like black lemur, tetraodon, and guinea pig, gap sizes are relatively small (~ 4 kb per Mb), whereas the gaps in shrew and platypus sequence correspond to an average of 36 and 58 kb per Mb, respectively. In the case of gap frequency, variation is less pronounced, i.e., less than fourfold for all but three of the 38 species; in fact, gap frequency ranged from 17 to 23 per Mb for over half of the species, near the overall median value of 20 gaps per Mb (Figure ). At the extremes for gap frequency are: (1) at the low end, black lemur and galago, with an average of 4 and 10 gaps per Mb, respectively; and (2) at the high end, platypus and torafugu, with an average of 42 and 48 gaps per Mb, respectively. A third characteristic, median gap size (Figure ), varies dramatically--from 67 bp for tetraodon to 523 bp for horseshoe bat; platypus is again at the high end, with a median gap size of 477 bp.
An overall comparison of frequency and size of sequence gaps indicates that some species' sequences (e.g., guinea pig and galago) have below average values; the finishing of these sequences is therefore more straightforward. Chicken sequences are intermediate, with a high frequency of relatively small gaps. Finally, at the extreme of difficulty are platypus sequences, which have substantially more gaps of larger size (compared to the overall averages).
Characteristics of sequences in gaps
The characteristics of the sequence within and immediately adjacent to a gap greatly influence the difficulty associated with filling that gap during sequence finishing. We thus examined the GC and repeat content of the sequence residing within gaps in the first data set, comparing those values to the overall averages for the total sequence generated for each species. The results for the 19 species with the greatest differences between gap sequence and total sequence are shown in Figure .
For many species, the GC content of the gap sequence is only marginally different from the GC content of the total sequence (Figure ). For some species where large differences are noted (e.g., guinea pig, marmoset, and chicken), we have not found human-grade sequence finishing to be particularly difficult; the application of standard finishing strategies involving alternate sequencing reaction chemistry (e.g., 80:20 BigDye:dGTP BigDye or dGTP BigDye supplemented with SeqRxA) generally permits the generation of the missing sequence with minimal effort. However, this is not the case for other species. For example, the gaps in dog sequence average 57% GC content, considerably higher than the 40% GC content of the total sequence. One particular dog BAC sequence [GenBank:AC090445] contains a stretch of 2,100 bp with a GC content of 83%; an embedded gap of ~ 250 bp (85% GC content overall) was refractory to most routine gap-filling efforts and ultimately required an extraordinary finishing effort to close. Several other dog BAC sequences contain at least one gap that presented similar challenges during finishing. Interestingly, such high GC-content regions in dog sequence gaps often reside near CpG islands.
Previously, we found that sequence within gaps is disproportionately more repetitive than non-gap sequence; specifically, total repetitive sequences identified by the program RepeatMasker [
29] account for roughly 50% of gap bases in comparative-grade finished sequence, while the subset of simple repeat sequences (predominantly homopolymer, as well as di-, tri, and tetranucleotide repeats) account for 3.7% of gap bases [
5]. Further, it is well-established that these simple repeat sequences present major challenges during sequence finishing (e.g., determining the exact length of a dinucleotide repeat). We thus compared the simple repeat content of gap sequences for the species in the first data set (Figure ). In aggregate for all species, 2% of the total sequence corresponds to simple repeats, whereas 10% of the sequence in captured gaps corresponds to simple repeats. Thus, simple repeats are generally enriched within captured gaps. The considerable variability seen among species with respect to this enrichment (Figure ) is even more dramatic. At the extreme, dog has an unusually high amount of simple repeats in its gap sequence (28%); by comparison, in 26 of 37 other species, this value is less than 10%. Table details the repeat content within all the sequences of our first data set (taken as a whole) as well as that in captured and uncaptured gaps. In addition to again noting the significant enrichment of simple repeats within captured gaps, these data reveal that long interspersed repeats (LINES) are significantly enriched in uncaptured gaps.
| Table 4Repeat content of total sequences and gap sequences. |
Captured versus uncaptured gaps
An early step in sequence finishing involves ordering and orienting contigs within the initial sequence assembly. This process extensively utilizes information about the paired forward- and reverse-primed sequence reads generated from each plasmid template during the shotgun phase. Two contig ends are considered adjacent to each other if two or more sequence reads in one contig connect to their plasmid 'mates' in the other contig; in such cases, the intervening gap is considered 'captured' by the spanning subclones. Furthermore, the calculated minimum distance between read pairs should not greatly exceed the average insert size of the subclone library. Sequence gaps not spanned by at least two appropriately spaced sequence read-pairs are deemed 'uncaptured' and add no spatial information to the sequence contig map. An assembly containing several uncaptured gaps can thus be quite challenging to finish, with other data required to deduce contig order and orientation; substantial additional effort is frequently involved during the finishing process, particularly in cases where the auxiliary data are weak or limited. Furthermore, generating the sequence of regions within captured gaps (e.g., during human-grade sequence finishing) is most readily achieved by sequencing the spanning plasmid subclones; in contrast, generating the sequence of regions within uncaptured gaps requires sequencing alternative templates (e.g., PCR products or purified BAC DNA), which is more costly and labor-intensive.
We analyzed the comparative-grade finished sequence generated for the 2,031 BACs in the second data set for the presence of captured and uncaptured gaps. In aggregate, these sequences contain roughly 30 gaps per Mb, with approximately two-thirds of these gaps being captured. This result is generally expected for shotgun sequence assemblies with at least eightfold sequence redundancy. However, the characteristics of the sequence gaps vary among genomic regions and among species (Figure ). For example, the average total number of gaps within the sequences generated for these 14 ENCODE pilot project regions ranges from 20 per Mb (ENr213) to 43 per Mb (ENm003), while the fraction of uncaptured gaps varies from 26% (ENm010) to 50% (ENr312) (Figure ). None of these are among the worst genomic regions with respect to gaps that we have encountered; the sequences generated for ENCODE pilot project regions ENr231, ENr232, and ENr333 (included in Figure for comparison purposes, but not actually part of the second data set) contain more gaps per Mb (56, 68, and 48, respectively) than any of the other analyzed regions. Similar differences can be seen among the sequences from 21 species in the second data set, with the average total number of gaps ranging from 19 per Mb (galago) to 58 per Mb (platypus), while the fraction of uncaptured gaps varies from 21% (hedgehog) to 50% (elephant and tenrec) (Figure ).
The effort required for establishing the order and orientation of sequence contigs--a key part of the sequence-finishing process--directly relates to the number of uncaptured gaps in the sequence assembly. While total gaps per Mb is an important metric to monitor, in practice, it is the presence of two or more uncaptured gaps in a given BAC sequence assembly that presents a significant challenge during sequence finishing. Within the second data set, the fraction of BAC sequence assemblies containing two or more uncaptured gaps ranged from 16% (ENr213) to 47% (ENm003) among the different ENCODE pilot project regions and 12% (vervet) to 60% (platypus) among the different species. BAC sequences from ENCODE regions ENm003, ENm005, ENr111, ENr211 and ENr312 (as well as ENr333 and ENr231) have routinely required extra finishing effort, mostly because over 40% of the sequence assemblies contained two or more uncaptured gaps. Meanwhile, in the case of vervet and squirrel BACs, nearly all sequence assemblies contained fully ordered and oriented contigs, since few of these BAC sequences had two or more uncaptured gaps. In contrast, platypus BAC sequence assemblies had an average of over eight gaps per BAC, with more than 44% of those gaps being uncaptured; this characteristic (in addition to other features of platypus sequence mentioned above) has made the finishing of platypus sequence more challenging than that of any other species' sequence encountered to date.
Strategies for reducing sequence gaps
In recent years, we have implemented various steps to reduce the number of gaps in our BAC sequence assemblies. For example, in the case of sequence assemblies containing two or more uncaptured gaps, we often will construct fosmid shotgun libraries from the starting BAC DNA, and then generate a modest number (e.g., 96) of sequence read-pairs from fosmid insert-ends. The resulting fosmid clones often capture the DNA residing in previously uncaptured gaps, thereby helping to order and orient the sequence contigs. While effective, such a fosmid-based approach is labor-intensive and expensive; it also introduces delays in finishing a BAC sequence because of the manual nature of fosmid-library construction and subclone-DNA purification. Of note, this approach has not been particularly successful for some species (e.g., platypus). In these cases, the sequence assemblies frequently contain too many contigs that cannot be unequivocally ordered even with fosmid-derived read-pairs (in part because the fosmid insert is too large, with the resulting insert-end reads simply skipping over many contigs and gaps). As an alternative, we found that generating read pairs with a 10- kb insert plasmid library was more productive.
Another approach that we have implemented involves the use of an alternate bacterial host strain for constructing the initial shotgun plasmid library. We reasoned that uncaptured gaps might reflect regions of foreign DNA that are 'poisonous' to the bacterial host, with subclones propagating such DNA in high copy-number plasmids either failing to grow or growing poorly [
30-
32]. Thus, a host that constrained plasmid copy number might reduce any bacterial growth inhibition, yielding a more uniform distribution of sequence reads across the starting template.
To test this idea, we selected three previously generated BAC sequences (produced using our standard bacterial host: E. coli DH10B tonA). Aliquots of the ligation reactions, which earlier produced the initial shotgun libraries, were used to separately transform competent cells of copy-control E. coli strain EPI400 (Epicentre Biotechnologies); BAC sequencing was then performed per our usual routine, after which we compared assemblies from sequence reads obtained with each host strain. Representative results from those comparisons are shown in Figure for a portion (~ 20 to 60 kb) of each of the three BACs. Note the greater uniformity in sequence-read redundancy obtained with the copy-control host strain and the elimination of gaps in two of the BAC sequences.
Based on these initial findings, we performed a similar study with eight additional BACs. Using our standard host strain, the sequence assembly of each of these clones had ≥ 5 uncaptured gaps (Table ). An additional two BACs [GenBank:AC190000 and GenBank:AC188356] (see Table ) were included as controls, with their sequence assemblies having 2 and 0 uncaptured gaps, respectively. In most cases, the sequence assemblies generated using the copy-control host cells contained a greatly reduced total number of gaps (especially uncaptured gaps; see Table ). In several cases, the total number of uncaptured gaps was reduced to <2, thereby eliminating the challenge of ordering and orienting sequence contigs. Interestingly, a significant reduction in the number of gaps per BAC is associated with decreased variation in sequence-read redundancy; we now use this correlation to predict whether a particular BAC sequence assembly with many gaps would benefit from the extra effort of preparing a supplementary shotgun library with copy-control competent cells.
| Table 5BAC sequence assemblies generated using standard versus copy-control bacterial host cells. |
Several points should be made about using such copy-control host strains for shotgun-library construction. First, because there is some reduction in the yields of purified plasmid DNA (and, therefore, sequencing success rates and read lengths) associated with the use of EPI400 cells, we only utilize copy-control host strains in special cases. For example, for species (e.g., platypus) whose initial sequence assemblies regularly have large numbers of gaps, we now routinely use copy-control EPI400 cells during the initial shotgun phase; for other species, these cells are used only after the BAC sequence assembly is reviewed and deemed to require such an approach {with the key determinant being a large variation in sequence-read redundancy (as seen in Figure ) rather than simply the total number of uncaptured gaps}. Finally, implementing the use of a copy-control host strain is straightforward in a high-throughput sequencing pipeline, making it a convenient and practical option on an as-needed basis. In practice, competent cells derived from the copy-control host strain are simply transformed by the original ligation reaction, eliminating the need to construct an entirely new library in an alternate cloning vector.
Utilization of 'next-generation' sequencing technologies
Recently, several 'next-generation' DNA sequencing technologies have become commercially available [
33-
35]. These involve the use of completely different methods for generating shotgun-sequence data, with the resulting sequence reads being considerably shorter than those generated by Sanger-based chemistries (but at a fraction of the per-base cost). We investigated whether data produced with one of these new technologies without cloning bias might complement our standard Sanger-based shotgun sequence reads in a fashion that reduced the problems encountered during sequence finishing.
We selected 12 BACs for this study: 10 whose sequence was previously found to be problematic at the sequence-finishing stage and 2 whose sequence was straightforward to finish (these BACs were derived from 8 different species and 7 different genomic regions). The purified DNA from each BAC was sheared and then ligated with a unique 'indexing' linker. All 12 samples were then mixed together, and the resulting pool then used to make a single shotgun sequencing library using the Genome Analyzer instrument (Illumina) [
33]. The resulting paired sequence reads (each being 35 bases in length) were sorted by BAC using the respective indices, with each set of reads then assembled using Velvet [
36]; the resulting sequence contigs for each BAC were then assembled with the original Sanger-based shotgun reads using Phrap. Preliminary results showed that the Velvet-derived sequence assemblies differed among the 12 BACs; for example, contig 'N50' size varied from 0.2 to 7 kb, with the 'largest contig' per BAC ranging from 2.8 to 24 kb. Of the contig sequences (totalling 2.35 Mb) that aligned to the Sanger-derived consensus sequences, there were few, if any, mismatches; thus, the quality of the Genome Analyzer-derived assembled sequence was extremely high.
Combining the Sanger- and Genome Analyzer-derived shotgun sequence data simplified some of the sequence-finishing process, but did not solve the most vexing finishing problems. For example, in some cases, the use of Velvet-assembled contigs helped to extend sequence from contig ends, provided complementary strand data, and resolved miscellaneous ambiguities (thereby eliminating the need to generate additional sequence reads during the finishing process). In fact, adding Genome Analyzer-derived data to the Sanger-derived assemblies reduced the total number of contigs for the 10 problematic BACs; for five BACs, a few of the uncaptured gaps were closed. However, new problems sometimes arose with the addition of Velvet-assembled contigs; for example, in several instances, new misassemblies formed where none existed in the Sanger-derived assembly. Most disappointing was the fact that many troublesome gaps in the Sanger-derived assemblies, which had been extremely problematic (and expensive) to close during human-grade sequence finishing, remained unchanged following the incorporation of Velvet-assembled contigs. Although some small contigs were found to partially fill these gaps, much of the missing sequence had no matching Genome Analyzer-derived sequence. These results point out the biases and limitations of both sequencing methods that utilize DNA polymerase.