The nuclear genomes of grass species vary widely in size due to polyploidization and amplification of repeat elements. On the smaller end of the size spectrum lies rice (Oryza sativa
) whose ~390 Mb genome has been sequenced [1
]. Mid-sized genomes such as that of maize (Zea mays
, 2.4 Gb) present a far greater challenge for sequencing, while large genomes such as bread wheat (Triticum aestivum
, 17 Gb) will require exceptional approaches. The great differences in genome size are mainly caused by differences in repetitive DNA content, primarily LTR (long terminal repeat) retrotransposons that can comprise more than 50% of a nuclear genome. Because the amplification of these sequences occurred in waves, assembly of contiguous sequence information without a physical map is difficult or impossible for genomes as large as the maize genome [2
Despite these great variations in genome size, the gene content of these species appears to be about the same per 2 N genome [3
]. Therefore, a number of "gene-enriched" sequencing techniques have been proposed with the goal of sequencing primarily the genic regions, while excluding repetitive sequence as much as possible. The oldest of these is EST sequencing, consisting of end reads from cDNA clones. More recently, full length cDNAs have been sequenced in rice [4
] and Arabidopsis [5
], and a similar project is underway in maize [6
]. These are by far the most gene-enriched techniques, albeit with some contamination from transcribed transposable elements (TEs) and other repeats [7
]. In addition, these transcription-based techniques automatically eliminate transcriptional pseudogenes, a non-trivial undertaking when analyzing genomic sequence alone. However, EST sequencing also has significant drawbacks, of which the most important is that it is strongly affected by transcriptional biases. Some genes are expressed at high levels per cell or tissue, while others make only a handful of mRNA molecules, and/or are active only in certain tissues or developmental phases. Hence, EST sequencing tends to miss some genes while sequencing others many times over, a costly redundancy. Furthermore, sequencing of expressed products captures only exons and excludes all information about promoter regions, and paralogs may be easily conflated because intronic sequence data are missing.
To overcome these drawbacks, other techniques have been developed that work directly with nuclear DNA. The most generic is high-CoT sequencing (HC), in which sheared DNA is separated based on renaturation time. Repetitive fragments renature faster, allowing them to be removed. This method has been applied in maize and resulted in 6-fold enrichment of genic sequences [8
]. A second approach, referred to as methyl-filteration (MF), makes use of the special properties of methylation in higher plant genomes. In these genomes, repetitive DNA is generally found to be hypermethylated, while genic regions are hypomethylated, permitting enrichment by cloning into bacteria that do not tolerate some forms of cytosine methylated DNA [10
]. For the MF technique, DNA is sheared from total genomic DNA and is inserted into a plasmid vector, followed by transformation of the library into an Escherichia coli
host that will not tolerate clones containing methylated DNA inserts; in maize this technique yielded gene enrichment comparable to the HC reads [8
]. A third approach employed in maize makes use of the RescueMu transposable element [16
], relying on the fact that Mutator
elements preferentially insert into low-copy-number DNA [17
A large dataset of HC, MF, and unfiltered (UF) sequences has been produced and assembled to generate the assembled Zea mays
(AZM) contigs [8
]. Version 4 of these assemblies was constructed from 450,166 MF, 445,565 HC and 50,866 unfiltered (UF) reads. This dataset contains separate MF-only, HC-only and UF-only assemblies, which are henceforth referred to as the "MF", "HC", and "UF" datasets. The RescueMU assemblies [19
], henceforth referred to as "RM", are also available for comparison.
The MF and HC datasets were derived from small-insert clones, with mean length of 2 kb for HC clone inserts and <1.5 kb for MF clone inserts. Small inserts were essential for these techniques, since longer clones will often include repetitive elements along with genes and thereby be excluded; however, the small sizes limit the ability of the read pairs to link contigs into scaffolds. RescueMu-flanking sequences also do not link over a substantial distance, as the reads are adjacent to the insertion site. Assembly of such reads leads to short contigs that result in a gapped alignment along a gene, especially because transposable elements may be present within introns [20
] and between core promoters and regulatory elements [21
]. Although comparison with EST databases and sequenced clones indicates that almost every gene is represented in the assemblies, only about 30% were fully covered by the first ~450,000 MF and HC reads [8
]. When these reads were compared to 100 random genomic regions of the maize genome, only 29% of the predicted genes were covered over more than 90% of their length [20
]. Furthermore, sequences assembled from these gene-enriched sets cannot be easily localized on a physical or genetic map.
To overcome these limitations, two complementary techniques have been proposed. Both of these approaches generate paired end reads from longer clones, allowing them to link contigs generated by the previous methods. Both techniques rely on methylation-sensitive restriction enzymes to cleave nuclear DNA preferentially in genic regions. For methylation-spanning linker libraries (MSLL, [24
]), DNA was subjected to complete digestion by restriction enzymes such as Sal
I or Hpa
II, and fragments of varying sizes (from 7 kb to >100 kb) were cloned. The relatively long length of these clones allows them to span repetitive regions between genes, thereby linking the genes; it also allows their integration into a BAC-based physical map based on DNA fingerprints. Hypomethylated partial restriction (HMPR, [25
]) libraries are similar but utilize only enzymes having 4-bp recognition sequences. Partial digestion was employed, and fragments from 2–4 kb were selected for cloning and end-sequencing. The need for two unmethylated sites in close proximity ensures that these clones often sample low-copy-number sequences, and that they can also provide valuable information for linking contigs into scaffolds.
Pilot studies (in maize) of 751 MSLL sequences [24
] and 2112 HMPR sequences [25
] demonstrated enrichment of genes (and depletion of retroelements and other repetitive DNAs) equal to or greater than that seen with MF or HC, with HMPR producing the greatest retroelement depletion seen outside of EST libraries. Such pilot studies cannot indicate, however, when such approaches saturate (i.e., lose value due to repetition in the data generated) or how generally useful they can be across a genome that has been largely or fully sequenced.
Although the DNA composition and arrangement in most plant genomes is complex and only narrowly understood, the much greater epigenetic complexity of these genomes is even more mysterious. Both MSLL and HMPR technologies provide full-genome capacity for the discovery of methylated blocks [24
]. Comprehensive analysis of a genome with MSLL and HMPR will uncover all of the blocks of DNA that are completely methylated, perhaps in a tissue or at different times in development, and these can be studied to find unusual components like methylated genes or unmethylated transposable elements. As with any genomics technology, a comprehensive and high-throughput use of MSLL and HMPR can identify and highlight important components that deserve more detailed study.
The following study reports results of a comprehensive study of MSLL and HMPR sequences in maize. The observations from the pilot studies are confirmed and extended. It is shown that MSLL clones longer than 100 kb may be generated, and that MSLL clones of size 35 kb and higher can be accurately placed on a BAC-based FPC map (fingerprinted contigs, [26
]). The MSLL clones are found to be particularly valuable for identifying fully methylated DNA blocks that could be discovered by no other technique, thereby allowing the identification of "genes" that are either annotation artifacts or exceptional in their epigenetic status. These valuable resources are made available to all scientists by providing their alignment to sequenced maize BACs as a web-based service [28