Acquisition of optical map data and construction of optical maps.
The three restriction endonuclease maps presented here were created via whole-genome shotgun optical mapping (1
). This mapping strategy has parallels to whole genome shotgun sequencing; large numbers of optical maps, analogous to “sequence reads,” are assembled to cover any given locus.
The resolution of the optical maps affects the contig rate and average molecule size in the contig (Table ). For the XbaI map, a total of 405 digested molecules were imaged and processed. Of this total, 204 were included in the whole-genome map contig, giving a contig rate of 50%. This low contig rate can be explained in part by the low resolution of the optical map. A map with an average fragment size of 44.73 kb requires very large genomic DNA molecules for contig assembly, due to the number of fragments in a single-molecule map required for confidence in merging that map into a map contig. The average size of molecules in the XbaI contig was about 900 kb, whereas the average size of collected DNA molecules was 637.69 kb. In addition, the digestion rate was calculated at 76.01%, which is lower than the target digest rate of 80% or higher. While still acceptable, a digest rate of 76.01% will reduce the density of apparent restriction endonuclease sites, or markers, thus increasing the difficulty in creating a contig. However, even with the low rate of contig formation, the total mass of molecules in the contig was 184.23 Mb, which corresponds to about 42-fold coverage.
Figure shows the finished XbaI map, which resembles a classical macrorestriction endonuclease map in terms of resolution. Notably, the whole-genome contig was circularized without gaps, and a typical restriction endonuclease fragment was calculated from the average of about 30 fragments. The sizing error per single fragment, or precision, was calculated from the set of restriction endonuclease fragments used to determine each consensus fragment. The average standard deviation of fragment size about the mean was 4.31 kb. The size of the R. rubrum XbaI circular contig was 4,323.08 kb, and was calculated by summing the restriction endonuclease fragments in the XbaI consensus map. This map is the lowest-resolution optical map created to date.
FIG. 2. Whole-genome circular XbaI map of R. rubrum. The outermost circle (thick line) represents the consensus map created by Gentig from the single-molecule maps shown as arcs. The single-molecule maps were made from single DNA molecules digested with XbaI. (more ...)
With an average fragment size of 31.53 kb, the NheI map has a resolution in between those of the XbaI and HindIII maps. A total of 409 molecules were collected and processed to form the NheI map. Of the total, 345, or 84% of the molecules, went into the circular contig. The average size of the collected molecules was 635.72 kb, only 2 kb smaller than the average size of molecules collected for the XbaI map. However, the average size of molecules in the contig was 731.30 kb, almost 200 kb smaller than the average size of the molecules in the XbaI contig. As the average fragment size of the NheI map is smaller, a molecule of a given size will have more restriction endonuclease fragments, and smaller molecules can be included in the contig. Thus, in comparison to the XbaI map, the increased NheI contig rate is due in part to the smaller average fragment size. In addition, the digestion rate was calculated to be 87%, which means that the patterns were more informative and accurate in this map. The mass of the molecules in the contig was 252.30 Mb; this represents approximately 60-fold coverage. The contig circularized without gaps, and about 42 fragments were used to calculate a given fragment mass in the consensus map. Based on the final NheI consensus map, the average standard deviation of the fragment size about the mean was 2.83 kb. Summing the masses of all of the restriction endonuclease fragments in the consensus map gave a total genome size of 4,223.13 kb.
Finally, 932 molecules were collected for the production of the HindIII map. With an average fragment size of 10.95 kb, this map is the highest resolution of the set of maps presented here. Of the 932 molecules, 623, or 67% of them went in to the final contig. The smaller average fragment size loosened the requirements for molecule size in the contig; the average size of molecules in the contig was 405.49 kb. The total mass represented by the molecules in the contig was 252.68 Mb, which corresponds to about 57-fold coverage, based on the HindIII contig size of 4,456.35 kb. Again, the size of the HindIII contig was calculated by summing the masses of the restriction endonuclease fragments in the consensus map. The digestion rate of the molecules in the contig was calculated to be 78.9%. An average of about 31 fragments was used to calculate the mass of each fragment in the consensus map. For each fragment in the consensus map, the standard deviation about the mean was 1.16 kb.
In all of the maps, the high coverage ensured accurate calling of restriction endonuclease sites, fragment sizing, and sizing of the entire circularized genome map. Below, the accuracy of the optical maps compared to the sequence is assessed. The false circularization probability for the XbaI, NheI, and HindIII maps was 0.00738, 0.00329, and 0.00440, respectively. Since the XbaI map had the lowest coverage, contig rate, and digestion rate, it is not surprising that the false circularization probability for this map is the highest. However, for all the maps, the false probability values were well below 0.05, which is considered the upper limit for confident map circularization (30
). The restriction endonuclease patterns generated by the XbaI, NheI, and HindIII maps appeared random; no particular restriction endonuclease patterns or structural features were observed.
Use of optical maps in sequence assembly.
All of the optical maps were made in order to guide and verify the R. rubrum genome sequence assembly process. Near the end of the finishing effort, nine sequence contigs were generated ranging in size from 2311 base pairs to 1,465,886 base pairs. Alignment of the optical maps against the DNA sequence-based maps of the sequence contigs gave three independent indications of sequence contig assembly and order. Two of the sequence contigs did not align against the optical maps. They were contig 84, the plasmid sequence contig, and contig 82c, which, with a size of 2.311 kb, was too small to align with the optical maps. Six of the seven remaining contigs aligned to both the XbaI and NheI maps. Only the HindIII map, with its higher resolution, was able to align all seven sequence contigs, including contig 83 with its small size of 80.404 kb (Fig. ).
FIG. 3. Use of optical maps in confirming assembly and order of sequence contigs. In silico maps of the sequence contigs were made and aligned against the whole-genome optical maps located in the center of each diagram. A) The high-resolution HindIII map enabled (more ...)
All three comparisons of optical map to sequence supported a problematic assembly of sequence contig 90. Alignment of the NheI map to the sequence contigs best illustrates this (Fig. ). Contig 87 and the rightmost eight fragments of contig 90 align to the region in red in the NheI map. Inspection shows a cleaner alignment of the region with contig 87. Removing the rightmost eight fragments from contig 90 and inverting the orientation produced a solid alignment with the gap in the NheI map that was between contig 90 and contig 85c (Fig. ). Our realization of this problematic assembly confirmed the Los Alamos finishing group's suspicions of an erroneous assembly in this region. Elsewhere in the genome, there was good agreement between sequence contigs and optical maps.
The finished sequence (GenBank accession number AAAG00000000) contained a 4.4-Mb circular chromosome (contig 94, exact size is 4,352,726 base pairs) and a 54 kb plasmid (contig 93, exact size is 54,412 base pairs). Minor differences have been found between the optical maps and finished sequence and are described below.
Assessment of optical mapping errors.
Comparisons between sequence and optical mapping data were made in order to evaluate the errors and accuracy in the XbaI, NheI, and HindIII maps (Fig. ). The relative sizing error was calculated by the alignment of optical maps with the DNA sequence-based maps made from the finished sequence (Fig. ). The error bars in Figs. , , and reflect the standard deviation about the means of the restriction endonuclease fragment sizes used in calculating the consensus map fragments. In general, a high degree of correspondence was evident between the optical map and DNA sequence-based map fragment sizes. The regression values for the trendlines are 0.9985, 0.9995, and 0.9947 for the XbaI, NheI, and HindIII maps, respectively.
FIG. 4. Comparisons of the XbaI, NheI, and HindIII optical maps to sequence data. (A, B, and C) Plots of optical map fragment sizes versus the DNA sequence-based map fragment sizes from the finished sequence for XbaI (A), NheI (B), and HindIII (C). The error (more ...)
Figures , , and are scatter plots showing the relationship between absolute relative fragment sizing error (optical map versus DNA sequence-based map) and restriction endonuclease fragment size. For the XbaI map, the average relative sizing error (see figure caption) was 6.20% for fragments smaller than 5 kb and 2.87% for fragments larger than 5 kb. For fragments smaller than 5 kb in the NheI map, the average relative sizing error was 5.87%, and it was 3.11% for fragments larger than 5 kb. Finally, for the HindIII map, the average relative sizing error was 16.70% for fragments smaller than 5 kb and 8.05% for fragments larger than 5 kb. The positive skew in the relative error plots for small fragments in Fig. are discussed in greater detail in the following section.
Figure shows the cumulative distribution of fragment sizes for the three optical maps. Only the XbaI map has fragments greater than 135 kb, and thus this value was chosen as an endpoint in the figure to facilitate visual comparison of the three maps' fragment size distributions. Each bar represents the cumulative percentage of consensus map fragments in 5-kb intervals. For each of the maps, the distribution is roughly exponential, as expected. One key difference between the HindIII and the lower-resolution XbaI and NheI maps is the proportion of fragments smaller than 5 kb and 10 kb. In the HindIII map, about 35% of all fragments in the consensus map are smaller than 5 kb; 72% of fragments are smaller than 10 kb, and 100% of fragments are under 40 kb (the largest fragment is 37.82 kb). These numbers are in stark contrast to those for the NheI and XbaI maps. In the NheI map, only 12% of fragments are smaller than 5 kb, and 32% of fragments are smaller than 10 kb. Similarly, in the XbaI map, only 13% of fragments are smaller than 5 kb and 25% of fragments are smaller than 10 kb. For the XbaI map, there were only two additional fragments greater than 135 kb: a 242.20-kb fragment and a 256.30-kb fragment (not shown). The increased average relative sizing error for small fragments (Fig. ) seen in the HindIII map may be due to the high proportion of fragments 2 kb or smaller, many in tandem with each other, in this high-resolution map.
FIG. 5. Cumulative distribution of optical map fragment sizes for the three optical maps. For each of the three whole-genome R. rubrum optical maps, the percentage of fragments within the size range from 0 to 135 kb is plotted. Each bar represents the cumulative (more ...) Comparing optical maps to the sequence.
An assessment of the previously described errors in the context of optical map to sequence alignment is necessary for distinguishing random errors from those that may consistently point to discrepancies between optical maps and sequence. Figure shows the linearized XbaI, NheI, and HindIII alignments of the consensus optical map to the corresponding DNA sequence-based map, in order to show the exact locations of discrepancies between the sequence and the optical maps.
FIG. 6. Linear view of consensus optical maps with DNA sequence-based maps. Solid black arrows represent the locations of missing cuts in the consensus optical maps. The XbaI and NheI consensus optical maps extend to the left of the origin of the DNA sequence-based (more ...)
The alignment of the XbaI map with the DNA sequence-based map showed that there were no false cuts (apparent in the optical map but not in the DNA sequence-based map) and 12 missing cuts (apparent in the DNA sequence-based map but not the optical map) out of a total of 100 XbaI cuts in the DNA sequence-based map. Optical maps normally do not report restriction endonuclease fragments smaller than 500 bp, and, due to the resolution of optical mapping, reporting of fragments smaller than 1 kb is incomplete (21
). The XbaI map had no missing fragments over 500 bp. Out of 100 fragments, the DNA sequence-based map showed two fragments smaller than 500 bp, and two fragments between 500 bp and 1 kb.
In comparison to the DNA sequence-based map, the NheI map showed no false cuts and one missing cut out of a total of 145 cuts in the DNA sequence-based map. There were four missing fragments, over 500 bp, in the NheI map. The DNA sequence-based map had no fragments smaller than 500 bp, and three fragments smaller than 1 kb, out of a total of 145 fragments.
Finally, the HindIII map showed no false cuts and five missing cuts in comparison to the 684 cuts in the DNA sequence-based map. Of the 684 fragments, 664 were greater than 500 bp. Of these fragments, 125 were missing in the HindIII optical map. Fifty-eight of the missing fragment loci, corresponded to DNA sequence-based fragments >500 bp and ≤1 kb, 59 to fragments >1 kb and ≤2 kb, and the remaining eight to fragments >2 kb and <3 kb.
Comparing the locations of the missing cuts and missing fragments revealed no consistent errors among the three optical maps. Thus, errors appear to be random and not associated with any major discrepancy between the sequence and the optical maps.