The completion of plant genome sequences has marked a decisive turn in our understanding of angiosperm genome content. These sequences confirm that transposable elements (TEs) are the major genomic component of most plant species, and they have facilitated important observations about TE evolution. First, TE families proliferate episodically and at different rates. As a result, closely related lineages may diverge rapidly in both TE content and genome size, as exemplified in both
Oryza and
Gossypium (
Hawkins et al. 2006;
Piegu et al. 2006). Second, within species, proliferation is counteracted by TE removal via recombination and population processes driven by natural selection (reviewed in
Tenaillon et al. (2010)). The opposing forces of proliferation and removal create the potential for rapid turnover of TEs and extensive within-species variation. These forces are also intimately linked to local genomic composition (i.e., gene content, nucleotide composition, methylation status, and recombination rate), thus generating heterogeneity in TE types across genomic regions.
TEs were first discovered in maize (
McClintock 1950) and TEs may still be best characterized in this species, particularly after the recent publication of a TE database containing exemplar sequences of 1,526 TE families and subfamilies (
Schnable et al. 2009). Overall, TEs constitute over 85% of the maize reference (B73) genome (
Schnable et al. 2009), of which the 20 most common TE families comprise ~70% (
Baucom et al. 2009). These 20 “common” families are all members of the Class 1 long terminal repeat (LTR)–retrotransposons, such as the
Gypsy and
Copia superfamilies. In the genus
Zea, amplification of LTR-retrotransposons has been particularly dramatic during the last 3 million years, leading to a doubling of genome size (
SanMiguel et al. 1998;
Brunner et al. 2005). Class 2 Miniature Inverted Repeat Elements (MITEs) are also abundant in maize (
Tikhonov et al. 1999;
Fu et al. 2001;
Wei et al. 2009), with some families represented by thousands of copies (
Zhang et al. 2000,
2001). Due to their small size (<500 bp), however, MITEs occupy much less of the genome than LTR-retrotransposons.
In addition to varying in copy numbers, individual TE superfamilies occupy different genomic niches. MITEs,
Helitrons, CACTAs, and MULEs tend to insert preferentially in genic regions (
Bureau and Wessler 1992;
Bureau et al. 1996;
Naito et al. 2006;
Wei et al. 2009;
Zerjal et al. 2009);
Mu elements exhibit insertional preferences in the 5′-ends of genes, correlating with epigenetic marks (
Liu et al. 2009); and high-copy-number retroelement families seem to preferentially target non-genic hypermethylated regions, where they nest into each other (
Wei et al. 2009;
Zerjal et al. 2009). Exceptions to these general rules do exist, however. For example, several papers report the presence of MITEs in non-genic regions, and retrotransposons commonly capture gene fragments (
Baucom et al. 2009;
Wei et al. 2009), suggesting they do occasionally insert into hypomethylated gene-rich regions.
TEs also facilitate structural variation within species, either through polymorphic TE insertions and deletions or by mediating ectopic recombination events. Comparing a region spanning 2.8 Mb between two maize inbred lines,
Brunner et al. (2005) found that, on average, more than 50% of the genomic sequence was not collinear. A similar result was obtained by comparing 8 haplotypes in the
bz region from several maize accessions (
Wang and Dooner 2006). This structural variation may also be responsible, in part, for pronounced differences in DNA content among maize accessions. Maize genome sizes range from 4.92 to 6.87 pg/2C (
Poggio et al. 1998), a ~1.5-fold size variation. However, TEs are clearly not the only genomic component responsible for substantial genome size variation within species, as DNA content also correlates with the number and size of heterochromatic knobs (
Laurie and Bennett 1985). These knobs are regions of heterochromatin comprised of 180-bp tandem-repeats (
Peacock et al. 1981), 350-bp tandem-repeats (
Ananiev et al. 1998), and various retrotransposons. They may account for as much as 8% of the genome (
Peacock et al. 1981;
Ananiev et al. 1998) but vary greatly in number, size, and genomic location across maize and its relatives (
Brown 1949;
Xiong et al. 2005;
Lamb et al. 2007;
Albert et al. 2010).
Maize is a member of the genus
Zea, which is traditionally divided into two sections:
Luxuriantes and
Zea. The former encompasses several species, including the annual diploid
Zea luxurians (hereafter
luxurians), whereas the latter consists solely of the annual diploid maize (
Z. mays ssp.
mays) and its closest wild relatives (ssp.
parviglumis and ssp.
mexicana; hereafter
parviglumis and
mexicana). Divergence between
parviglumis and maize is very recent, dating to domestication about 9,000 BP (
Matsuoka et al. 2002). In contrast,
Z. mays sensu lato and
luxurians diverged ~140,000 years ago (
Hanson et al. 1996;
Ross-Ibarra et al. 2009), and the genomes of
luxurians and maize differ in size (
Poggio et al. 1998). To investigate the nature of this difference in genome size,
Meyers et al. (2001) assessed the abundance of 6 retroelements in both species but found little evidence of variation in copy number between species. In contrast, knob repeats seemed to be more numerous in
luxurians than in maize (
Meyers et al. 2001).
High throughput, “next generation” sequencing offers a unique opportunity for whole-genome analysis via either de novo assemblies or mapping to a reference genome. These approaches have also proven useful for assessing structural variation in species such as Drosophila melanogaster (Cridland and Thornton 2010). However, the complexity of plant genomes and the extent of their repetitive fraction will likely render these tasks much more challenging than in simpler eukaryotic genomes. The genomic complexity and fluidity of Zea makes it an excellent model system for addressing evolutionary dynamics of TEs within and between species. Here, we use paired-end Illumina sequencing to evaluate genome content in maize and luxurians, with several main goals. First, we compare our inferences with Illumina data to the maize B73 reference genome, to determine whether our short read sequencing approach yields reasonable estimates of copy number. Second, we assess the sampling required with Illumina sequencing to gain robust estimates of TE content. Third, we investigate insertion biases of TE families near genes versus those nested into other TEs. Finally, we evaluate the difference in TE content between the maize B73 genome and an accession of luxurians.