Barley (
Hordeum vulgare L.) is among the four most important cereal crops worldwide [
1]. But in contrast to its agronomical importance efficient gene isolation and genome-wide studies on genetic diversity are hampered by the lack of a reference genome sequence. Such a reference would resolve barley's genetic outfit and would serve as the essential basis to elucidate mechanisms underlying phenotype and traits as well as processes towards plant's adaptation and improvement.
Genome size (~5 Gb) and the high content of repetitive DNA elements (>80%) are the major obstacles towards sequencing the entire barley genome [
2,
3]. In contrast to Sanger sequencing [
4] for a budget of over 100 million USD (T. Sasaki, personal communication) a medium sized plant genome like rice (~400 Mb), the same endeavor for barley was not affordable (for review see [
5]). Here, the massively parallel or "next generation sequencing" (NGS) technologies, currently represented by the 454/Roche, Solexa/Illumina and SOLID/ABI platforms, promise to change the situation since several Gigabases (Gb) of sequence data can be accumulated in a few weeks for only a fraction of the costs of Sanger sequencing (for review see [
6-
8]). NGS technology was successfully applied to
de novo and re-sequencing of entire prokaryotic genomes [
9] and to re-sequencing higher eukaryotes including humans [
10-
13]. Recently, similar efforts were made in plants by using the Solexa/Illumina platform for re-sequencing of
Arabidopsis thaliana [
14] and by a mixed Sanger and 454/Roche sequencing strategy for grapevine (
Vitis vinifera) [
15]. Whereas the relatively short read lengths of the Solexa/Illumina (GAI/II) and ABI (SOLID) platforms (35-75 and 30-50 bp, respectively) may not yet match the requirements to sequence efficiently across long stretches of repetitive DNA in barley, the 454/Roche system (GS FLX) allows to generate average read lengths of ~250 bp (GS FLX) and ~400 bp (GS FLX Titanium) which are potentially more appropriate to achieve the goals of
de novo sequencing in complex genomes. However, it remains to be proven whether this holds true with regard to the extraordinarily high content of repetitive DNA elements within the barley genome, often forming blocks extending over regions of several 100 kb [
16].
Independently of the platform, two different sequencing strategies are widely used. Whole genome shotgun (WGS) sequencing is based on random shearing of whole genomic DNA and is preferentially applied to medium sized genomes with limited amounts of repetitive DNA. For plant genomes, WGS by NGS was so far restricted to re-sequencing purposes if a reference sequence was available (i.e.
Arabidopsis thaliana [
14]) and to
de novo sequencing (with or without NGS) of small and medium sized genomes like strawberry (<200 Mb per haploid genome) [
17,
18] and
Sorghum bicolor (~730 Mb) [
19], or with support of non-NGS data (grapevine) [
15].
The second, hierarchical shotgun (HS) approach is based on sequencing bacterial artificial chromosomes (BAC) anchored to a physical map ("clone-by-clone" sequencing). This strategy is more costly than WGS but in return is suitable to generate high quality reference sequences even for highly repetitive genomes [
5]. The map-based strategy was not only applied to sequencing the human genome but also to plant genomes such as
Arabidopsis [
20], rice [
21] and maize [
22]. Due to its accuracy and reliability, the "clone-by-clone" strategy was also favored for producing a high-quality reference sequence of the barley genome [
2,
23].
Previously, it was demonstrated that genes could be assembled into contigs when barley BACs were sequenced by short reads of ~100 bp provided by the earlier 454/Roche platform (GS20) at sequence coverage of ~10 - 20-fold [
24]. Similar results were obtained by sequencing BAC clones of salmon (
Salmo salar) using the GS FLX (~250 bp read length), however, the potential of the method to result in high-quality BAC clone sequences was put in question [
25].
Based on these initial studies the 454/Roche platform can be considered a robust platform to assemble genes from genomic sequences given sufficient sequence coverage. However, at sequencing capacity of up to 500 Mb per single GS FLX Titanium run the sequencing of individual BAC clones would be a rather non-economical approach and efficient use of the technology would require the possibility of multiplexing individual samples. Recently, pools of 28 BAC clones of wild rice
Oryza barthii, selected from fingerprinted contigs, were sequenced by the 454 technology and assembled to superscaffolds by mapping to the
O. sativa rice reference genome [
26]. Due to the lack of a reference genome this BAC pool sequencing approach is not yet feasible for barley and multiplex sequencing would require a reliable tagging (barcoding) strategy to reveal sequence read and BAC clone origin relationships. Barcodes are specific short sequence tags that can be introduced either before the 454 sequencing library preparation [
27] or by ligation of individual adaptors ("MID" = Multiplex Identifier", Roche Diagnostics) to fragmented BAC DNAs prior to sequencing in pools.
Here, as a proof of concept for a new strategic component of sequencing a large complex and highly repetitive crop plant genome in a clone-by-clone approach, we report the pool sequencing of 91 barcoded, randomly selected, gene containing barley BACs by the 454 technology. Furthermore, we present the assembly of the sequence data under variable parameters and evaluate the resulting assemblies for their consistency and reliability.