Several genome centres have established very effective "pipelines" for the sequencing of entire genomes, often using a mixed strategy drawing data from mapped clones and whole genome shotgun efforts [1
]. However, such pipelines are rapidly evolving through the incorporation of new, high-throughput sequencing platforms, that obviate conventional clone libraries, or Sanger sequencing chemistries [2
], but yield sequence reads of limited length. As such, current and emerging sequencing strategies benefit from the availability of compatible physical maps guiding clone selection, or facilitating assemblies that may also enhance validation efforts [1
]. Furthermore, physical maps become a critical necessity when spanning repeat-rich genomic regions, typified by telomeric or centromeric portions of chromosomes. Although clone fingerprint [5
] or end-sequence maps [6
] are widely used in whole-genome squencing efforts, gaps and sequence assembly errors may persist stemming from "clone drop-outs," or uncertainties in the map assembly process caused by the presence of repeated sequence elements. Clearly, physical maps will continue to be an important feature of large-scale sequencing projects, but new mapping approaches must advance in ways that effectively deal with the trend towards obviation of traditional libraries and the abundance of modestly sized reads. The main issues will likely center on the validation of "strategically" unfinished genome sequences and comprehensive description of genome structure. These issues become acute when sequenced genomes are selected from nascently described organisms lacking genetic resources or an associated scientific community. Consequently, future comparative studies could suffer from sequencing errors, and not fully discern structural variation – a major feature of genome evolution and a source of disease genotypes.
The rice (Oryza sativa
) genome was originally chosen for sequencing because it is a staple food crop for more than half of the world's population, in addition to its many genetic attributes, or resources that include: a compact genome (~400 Mb), well-defined genetic maps, Yeast Artificial Chromosome (YAC) and Bacterial Artificial Chromosome (BAC)/P1-derived Artificial Chromosome (PAC) map resources, comprehensive sequence-tagged or transcript maps, and efficient genetic transformation techniques [7
]. Rice also shares extensive syntenic relationships with other cereal plants bearing huge genome sizes such as maize (~2,500 Mb), barley (~4,900 Mb), and wheat (~15,000 Mb) [21
]. The large-scale sequencing of rice (O. sativa
cv Nipponbare) was initiated in 1998 under the auspices of the International Rice Genome Sequencing Project (IRGSP), with joint efforts from Japan, the United Sates, China, Brazil, Great Britain, France, India, Korea, and Thailand [27
]. IRGSP members decided at the time to pursue the collaboration-friendly, clone-by-clone, or BAC/PAC-by-BAC/PAC strategy supported by extensive map resources. For example, BAC/PAC draft sequences or contigs were anchored and oriented on the rice genetic maps – these contigs were further augmented by BAC-end sequencing (via
Sequence-Tagged Connectors) and contig-end walking. BAC maps and fibre Fluorescent in situ
Hybridization (FISH) were also used for characterization of gaps present within low recombination regions or genomic portions showing modest BAC/PAC coverage [17
]. The IRGSP release of the rice genome is now finished with publication of the analysis and annotation of these data [29
]. Here, the IRGSP sequence for each chromosome – Build 4.0, released in August, 2005 – is represented as a "pseudomolecule," or a virtual contig. Each pseudomolecule is constructed by joining PAC/BAC sequences according to their order determined by comparison with a previously constructed physical map [30
]. Finishing steps include identification and removal of overlapping sequences with resulting physical gaps replaced by a variable number of successive "N's", reflecting their estimated breadth. There are 62 physical gaps including 17 telomeric gaps, and 9 centromeric gaps with a total size of ~18.1 Mb [30
], with one gap closed in chromosome 1 and some of them partially filled [36
] within the current build.
In a parallel effort, TIGR (The Institute for Genomic Research) also constructed similar pseudomolecules for each of 12 rice chromosomes [37
] to enlarge the span of regions comprising blocks of contiguous sequence – their approach included: the resolution of discrepancies between overlapping BAC/PAC clones, trimming of overlap regions, and linking of unique sequences. These efforts relied on 3,450 rice BAC/PAC clone sequences obtained from the IRGSP; of these, 3,408 BAC/PAC clones (98.8%) were finished, and 42 BAC/PAC (1.2%) clones were unfinished (phase 2), as defined by Genbank. As such, the data show many gaps between clones, i.e
., physical gaps, denoted by "1000N's" in the final pseudomolecules. Finally, centromeres were identified using the "CentO" centromeric sequence [38
]. There are 48 physical gaps within the 12 pseudomolecules including gaps at 10 centromeres.
In addition to the rice genome sequence from the IRGSP and TIGR, the draft rice genome sequence of the same cultivar Nipponbare was generated by two separate private sources: Pharmacia, Inc. (previously part of Monsanto, Inc., Peapack, NJ) and Syngenta, Inc. (San Diego, CA) [39
]. A 259 Mb draft sequence from Pharmacia was also generated by a clone-by-clone based strategy [40
], while the 390 Mb Syngenta rice genome draft sequence consisting of 42,109 sequence contigs was obtained using a whole-genome shotgun sequencing approach with an estimation of 32,000 to 50,000 genes for this cultivar. A draft sequence (361 Mb out of the estimated genome size of 466 Mb) of the O. sativa
L. ssp. indica
cultivar (93-11) was also obtained by the Beijing Genomics Institute (BGI) [41
] using a whole-genome shotgun sequencing approach with an estimated gene count of 46,022 to 55,615 for this subspecies. Using TIGR's pseudomolecules and publicly available rice EST sequence data, Affymetrix has recently constructed a rice gene expression array with 46,115 rice gene models (Affymetrix, personal communication). In this regard, amongst sequenced genomes rice comprises the largest number of predicted genes, with more genes than human, and almost double that of Arabidopsis thaliana
Although the current releases of rice sequence from IRGSP, and TIGR are of very high quality, difficult gaps remain to be spanned for each of the 12 chromosomal pseudomolecules. These gaps persist because some reside within genomic regions showing sparse coverage of genetic markers used for anchoring BAC/PAC clones, and others suffer from library construction, which may bias against heavily repetitive regions. Importantly, existing gaps probably contain many functional genes [36
], even within centromeric regions [42
], in addition to information describing chromosomal structure. Furthermore, sequence contigs, within the pseudomeolecules, may still contain errors, in part, arising from assemblies conducted in repeat-rich regions of the rice genome. Given these issues, we constructed a genome wide optical map for determination of the size of sequence gaps and for identification of problematic sequence assemblies (discordances) through the analysis of sequence build alignments against our optical restriction map.
Optical mapping is now a robust, automated system for the construction of whole genome ordered restriction maps from ensembles of individual genomic DNA molecules [43
]. Library construction, PCR amplification, hybridization, and their attending artefacts are obviated in optical mapping, since genomic DNA is the analyte and restriction enzymes are used to generate reliable markers. Therefore, high-resolution physical maps are created on a whole genome basis presenting an organism's genetic constitution in a form directly linkable to sequence data. Using a newly automated version of optical mapping system, we have constructed a whole genome physical map for the rice genome using the optical mapping system, employing schemes akin to whole genome shotgun sequencing approaches. Given the results presented here, we find that there is no practical limit to the optical mapping of large, complex genomes, even in the absence of any sequence information, since the level of automation we have achieved provides ample data sets for our assembly techniques to span entire genomes. However, accessible sequence data for any genome provides direct links to a multitude of annotation and analysis tools that are especially facilitated when genomes are chosen to be strategically sequenced, such as maize.
More specifically, we present the construction of a whole-genome shotgun optical restriction map, using SwaI, of the rice (Oryza sativa) genome, and its comparison to sequence builds. Because physical distances (in kb) between restriction sites are accurately determined by optical mapping, alignments of an "in silico" restriction map, constructed from sequence data, against our optical map reveal discordances characterized by features commonly associated with ordered restriction maps. They include: missing or extra restriction "cuts," missing or extra restriction fragments, and significant alterations of restriction fragment sizes or patterns. Also, large-scale discordances covering hundreds of kilobases are discoverable and described here.
As such, we show that a high-resolution physical map based on the direct analysis of genomic DNA, spans existing sequence physical gaps, validates the genome sequence assembly, characterizes gaps, corrects sequence misassemblies, and creates a physical scaffold for sequence finishing. We expect that this map will also secure a resource for the genome sequencing communities at-large in their investigation of rice subspecies and cultivars. We also think that the maps presented here will facilitate the final validation of the rice sequence data, which should strengthen the important role that this genome is already playing as an accessible model system for other plants and cereal crops.