Genic, genetic, and physical features of a contiguous 22Mb sequence of maize chromosome 4
Contig 182 in the B73 physical map of the maize genome 
is located on chromosome 4 (Chr4), and was selected for analysis due to its large contiguous size (~22 Mb) and exceptional colinearity with rice Chr2 ().
Physical and genetic features of AR182.
Many interesting genes have been identified in this region (Table S1
), such as rf2b
(a paralog of a nuclear restorer of cytoplasmic male sterility that encodes an aldehyde dehydrogenase), opaque endosperm 1
(mutations in which result in defective kernels), nitrite reductase 2
(a gene involved in the accumulation of cuticular waxes), and QTL related to ear length, diameter, grain yield, kernel length, weight, oil/protein/starch content, pest resistance and disease resistance 
. Although several genes in this region have been cloned and functionally characterized, e.g., nii2 
, rf2b 
, and gl4 
, none of the QTL have been functionally characterized.
Starting with the sequence-ready physical map 
, we selected a MTP of 176 BAC clones (Table S2
) across contig 182 using the MTP analysis function of the Fingerprinted Contigs (FPC) program 
. Standard shotgun sequencing protocols were employed for each BAC, and assembled sequences (~4–6× redundancy) underwent K-mer analysis to identify repeats 
. The remaining low-copy-number regions were finished to high quality. Pseudomolecules were constructed using BAC end sequences, overlap and scaffold information, and were adjusted and validated by alignment with the maize B73 optical map. (; 
; see Materials and Methods
section for details.) The final sequence contained 21,702,972 bp in 907 uninterrupted sequence blocks, herein referred to as accelerated region 182 (AR182). The contig N50 is 57,261 bp, and the scaffold N50 is 160,621 bp.
In this region, there are 178 genetic markers (Table S3
) from bin 4.06 to bin 4.08 in the IBM2 2008 Neighbors maize genetic map (http://www.maizegdb.org/map.php
)—a consensus map compiled from all available maize mapping populations. Among the 150 markers with sequence information, 124 were identified in AR182, and 18 were located in flanking contig 181 (19 markers) or neighboring contig 197 (1 marker). Of the remaining eight markers, all were placed in other regions of the maize genome. Seven of these eight markers are multiple copy RFLP markers and could not be detected on maize Chr4 at e−5
, perhaps because these restriction fragment length polymorphism (RFLP) markers were incorrectly mapped, or are not present in the B73 genome. Two companion studies 
used resequencing and comparative genome hybridization to demonstrate that maize exhibits high frequencies of haplotype-specific sequences (Presence/Absence Variation). Many of these PAVs may have arisen via a consequence of the movement of transposable elements carrying genes or gene fragments. This finding, in combination with our use of a consensus map derived from multiple mapping populations may explain the absence of the eight genetic markers in AR182. Among the 13 framework markers with solid genetic positions, 12 had corresponding sequences in AR182. With the exception of two adjoining markers (umc104a
) with switched positions, all other markers had the same order in the physical map as in the genetic map. The ratio of genetic to physical distance across AR182 averaged 4.4 cM/Mb (Table S4
), somewhat lower than the previously-estimated genome average of 5.5 cm/Mb 
Transposable elements and their contributions to maize genome evolution
Transposable elements (TEs) are the most numerous and unstable components of the maize genome, and of all other complex plant genomes studied to date. In addition, TEs significantly complicate genome assembly and annotation because they are often repetitive, can be located in and around genes, and often encode ORFs that are easily mistaken for standard plant genes 
. Because many of these TEs, especially the long terminal repeat (LTR) retrotransposons, are large and very similar in sequence due to their recent amplification, repetitive TEs are a major source of gaps and misassembled contigs in complex plant genomes. The simplest way to minimize the negative impact of TEs on gene discovery and annotation is to initially describe all of the TEs in a region. This allows TEs to be computationally masked, thereby providing a residual sequence that can be carefully analyzed. Structure-based searches are especially useful for the discovery of novel TEs, especially given that many are both low in copy number and represented in EST libraries.
TE and other repeats were sought within the assembled sequence of AR182 by several independent approaches. Repeats per se
were identified using an oligonucleotide counter that searched for the representation of all possible 20-mers in 1,124,441 whole genome shotgun reads (1,088,525,270 nucleotides; ~0.45 genome equivalents 
). Repeats also were found by homology to known repeats in the MIPS REdata database (v4.3) 
and TE exemplar databases 
. Finally, structure-based searches were employed to identify novel TEs, including those that exist in low copy numbers. These structure-based search processes rely on the unique characteristics of particular classes of TEs, especially their end structures, but require significant manual curation to confirm the validity of any candidate TEs that are identified.
The most abundant repeats identified were the LTR retrotransposons, which were found to constitute about 74.6% of the assembled sequence. The identified LTR retrotransposons were divided into 237 families. Intact elements (i.e. with 2 LTRs and the appropriate internal sequences) were found in this region for 47 of these families. One hundred and eighty-one of these families were represented in maize EST libraries (data not shown). The specific elements present, their copy numbers and their relative coverage on AR182 are provided in Table S5
. As seen in earlier studies of maize 
and other large plant genomes 
, most of these elements are inserted into each other in nested arrangements with the oldest elements at the base of the stacks (e.g. Figure S1A
and Figure S2A
). Two other classes of retroelements, LINEs and SINEs, were located in this region, providing 1.1% and 0.03% of the assembled sequence, respectively (Table S6
In AR182, Copia
-like retrotransposons were found to be over-represented (29.2%) relative to the entire maize genome (23.7%), while Gypsy
-like retrotransposons were found to be under-represented (38.9 vs 46.4%; ). These results agree with earlier studies 
showing that different LTR retrotransposons preferentially accumulate in different areas of the maize genome. Although all of these high-copy-number LTR retrotransposons appear to prefer to insert into each other rather than into genes, they also distinguish LTR retrotransposon clusters that are near genes and those that are in largely gene-free regions like pericentromeric heterochromatin. In yeast, this class of elements finds insertion sites by association between the element-encoded integrase and specific heterochromatin proteins 
. The presence of chromodomains in some, but not all, plant LTR retrotransposons 
suggests a similar targeting mechanism.
Summary of transposable elementsa.
DNA transposons also were well represented in this region (), including 92 CACTA elements (66 families), 420 hAT elements (178 families), 744 MITEs (182 families), 163 MULEs (88 families) and 1,149 mostly fragmented Helitrons (6 families), and each class comprised between 1–3% of AR182. Few of these elements are likely to be autonomous (encoding all the functions needed for transposition). For seven of the CACTA families, we found at least one copy with intact open reading frames. Four Helitrons were found to contain apparently full-length Rep/helicase genes with protein products believed to be necessary for transposition.
Unlike the highly abundant LTR retrotransposons, the MITEs, Helitrons
, CACTAs and MULEs primarily were found to be associated with genes (Figure S3
). This is also the case for small SINE retroelements, as most copies present in the AR182 region were found in gene introns. The preferential insertion and/or retention of these lower-copy-number elements in these presumably euchromatic regions has the advantage of maintaining their potential for expression. However, by locating in recombinationally active regions near genes 
their potential to contribute to genome rearrangements is increased.
Perhaps the most amazing characteristic of the maize genome is the incredible number of gene fragments that are found inside TEs. Several classes of TEs have been found to acquire and transpose fragments of normal cellular genes, with MULEs and Helitrons
particularly active in this regard 
. AR182 was found to contain 20 LTR retrotransposons with apparent gene fragment insertions, plus 9 MULEs, 5 CACTA TEs, and 187 Helitrons
with one or more acquired gene fragments (Table S7
). The capture of gene fragments by LTR retrotransposons and CACTA elements has been reported before 
, but the extent has not been known for any plant genome. The analysis of AR182 demonstrated that this is a common phenomenon in maize.
In purely automated genome annotations, most or all of these fragments would have been counted as genes. Hence, in this region, 1,009 rather than 544 genes would have initially been predicted, and extrapolations to the entire maize genome nearly would have doubled overall gene content. Combining this error with the common error of annotating TE-encoded transposition genes as standard plant genes principally is responsible for the two-fold or more errors in gene content that have sometimes occurred in plant genome analysis 
. Beyond the complications they create for gene discovery and annotation, the gene fragments within TEs also generate many questions about their possible contributions to host cell biology. Although the rapid rate of removal of unselected DNA from plant nuclear genomes 
suggests that the great majority of the gene fragments and multi-gene chimeras within TEs rapidly become extinct, even the rare creation of a novel gene by the process of exon shuffling 
could have enormous biological significance. Many cases of “transposon domestication” 
, where all or part of a TE has been co-opted by the host organism to perform an important biological function now have been reported. The acquisition of gene fragments from multiple loci, and their fusion with each other and with standard TE proteins, should only increase the potential for valuable novelty and domestication. Equally important, the epigenetic silencing of TEs by siRNAs 
predicts that many of the gene fragments inside TEs could contribute to the pool of siRNAs, and thereby acquire regulatory roles over the genes from which they were derived. Perhaps this is the mechanism of origin of some microRNAs, as fragments created by TEs that have evolved to encode specific small RNAs that regulate the source gene 
The distributions of these TEs across the region appeared uneven when viewed at the level of the entire AR182 (). Among LTR retrotransposons, the concentrations of Gypsy- and Copia-like elements were correlated inversely. On a smaller scale, specific TE arrangements were found to be highly non-random. LTR retrotransposons primarily were inserted into each other and away from most genes, while DNA transposons, such as CACTAs, Helitrons, and MULEs, or small retroelements such as SINEs, were near genes. It should be noted that novel TEs that are low in copy number or have no intact copies here or elsewhere in the maize genome will still have been missed in this annotation process, so it is expected that this will cause some under-estimate of TE number and an over-estimate of gene number.
TE and gene distribution along AR182.
Gene identification and characterization
Annotation of protein-coding genes was based predominantly on extrinsic evidence, using a gene building process adapted from Ensembl 
. Sources of evidence included sequences from maize full-length cDNAs 
) as well as ESTs and proteins. Ab initio
predictions were included only where they did not overlap with evidence-based genes, or where overlap allowed extension of coding sequences. Although known repeats were masked prior to annotation, additional measures (see Materials and Methods
) were needed to screen TEs, a common source of false positive predictions in plants 
. Manual methods also were used to identify and remedy falsely split or fused gene models, though these were relatively rare. The resulting gene set includes 544 annotated loci, of which 514 were evidence-based, including 160 by full-length cDNAs (Table S8
). Overall, AR182 has a gene density of 25 genes per Mb. Gene content in the 2045 Mb RefGen_v1 whole genome assembly was estimated at between ~37,000 and ~39,000, giving a gene density of 18 to 19 genes/Mb 
. Hence, AR182 is relatively gene-rich compared to the genome overall. Seven pairs of genes were found to be overlapping, and this conclusion is supported by full-length cDNA or protein homologs in other species. In rice, the presence of overlapping genes is relatively common and most are caused by transcripts using the promoter or enhancer of LTRs in a retrotransposon (Wei and Wing, unpublished). Given the large number of LTR retrotransposons in maize, it would not be surprising if the observation of overlapping genes is be common in maize. Among the non-overlapping genes, the intergenic spaces in 246 (45.3%) of the 543 gene spaces were less than 10 kb while 240 (44.2%) genes were separated by more than 20 kb. Fifty-four of the intergenic regions were greater than 100 kb, with the largest being 530 kb (Figure S4
). Most of these large intergenic regions are filled with nested LTR retrotransposons (Figure S1
and Figure S2
Gene, exon, and intron lengths, as well as number of exons per gene, were found to be within previously estimated ranges 
, as shown in . To make comparisons with other cereals, we selected 341 ortholog sets having three-way colinearity within syntenic regions of maize, sorghum and rice (). Exon lengths were relatively invariable across species, consistent with previous findings 
. This contrasts with introns, which averaged 229 bp, 361 bp, and 498 bp for rice, sorghum, and maize, respectively. Haberer et al. 
had previously reported this trend and also found examples of introns harboring TEs, suggesting that such insertions were responsible for inflated intron sizes in maize, which is consistent with earlier reports of TE and retrotransposon insertions within maize introns 
. To further examine this hypothesis, we directly compared orthologous introns among maize, sorghum, and rice. Introns were paired based on their conserved position between flanking mapped exons (see Materials and Methods
). When introns of less than 1 kb were considered, lengths between pairs were strongly correlated (Figure S5
). The correlation was greater between maize-sorghum than between maize-rice, consistent with their more recent divergence. However, maize had more large introns, leading to discrepancies in paired intron lengths. For example, 2.6% of maize introns were observed to be larger than 3 kb, whereas this number was only 0.47% in sorghum and 0.17% in rice (Figure S6
). Length discrepancies in which the maize intron exceeded the length of its cross-species partner by more than 1 kb occurred in 4.7% of mapped intron pairs, whereas the reverse was true in only 0.55% of cases (Figure S7
). Figure S8
shows a clear linear relationship between length discrepancies in positionally conserved introns and repeat content within such maize introns. All told, about 2.4% of maize introns harbor repetitive sequences exceeding 1 kb or greater (an example of nested LTR retrotransposons in an intron shown in Figure S9
) and 11% of intron-containing maize genes have at least one intron with this characteristic. That these genes are active is strongly indicated by evidence derived from GenBank mRNAs/full-length cDNAs.
Comparison of maize, sorghum, and rice genesa.
Besides these protein-coding genes, five miRNA genes in four families were computationally identified. The overall density of miRNAs in this region is 3 fold higher than the average genome distribution and all 5 genes have evidence of expression based on small RNA libraries 
Synteny analysis across maize, rice, and sorghum in AR182
Previous studies have shown that extensive genetic colinearity and synteny exist among the maize, rice and sorghum genomes 
. All those studies were based on either genetic markers or short contiguous sequence analysis. In this study, four sequence-to-sequence comparisons were performed among the three species, including maize-rice, maize-sorghum, rice-sorghum, and maize-maize analysis using BLASTZ 
and the Synteny Mapping and Analysis Program (SyMAP, 
). AR182 on maize Chr4 was found to align with rice Chr2 (29,020,340–35,806,283; ) and sorghum Chr4 (57,193,840–60,617,265 and 63,725,383–67,939,454; ), and maize Chr5 from part of ctg250 to ctg254 (
; Figure S10
; clone list in Table S9
). While shows a pairwise pseudomolecule-to-pseudomolecule comparison of sequences, shows a comparative map based on homologous genes within these regions. The map in uses rice as a common reference because rice has been consistently identified as containing a relatively stable genome that closely resembles the ancestral state 
. In the syntenic regions, there were annotations of 544 maize genes, 825 rice genes, and 847 sorghum genes. A higher level of synteny was observed between rice and sorghum than between maize and rice. Indeed, 686 (83.2%) of the 825 rice genes in the corresponding region were found to be syntenic to sorghum, while 375 (45.5%) of the rice genes were syntenic to the maize region. The same was true in that 685 (80.9%) of the 847 sorghum genes were syntenic to rice, while 362 (66.5%) of the 544 maize genes were syntenic to rice. Direct comparisons between maize and sorghum in AR182 revealed that 394 (72.4%) maize genes were syntenic to sorghum, while 396 (46.8%) sorghum genes were syntenic maize genes (Figure S11
). Of course, any false positive gene annotations of TEs as genes in any of these regions 
would be perceived as having non-syntenic relationships. It should be noted that the selected AR182 region is highly collinear with rice, however, at the whole genome level, maize is probably less syntenic with rice than estimated here. All five of the miRNA genes were found to be syntenic () to a corresponding region in rice and sorghum. Four of the genes also were retained on the homeologous arm.
Comparative mapping of protein-coding and miRNA genes in orthologous segments of the rice, sorghum, and maize genomes.
Two hundred and forty-one genes maize genes (44.3%) from AR182 were syntenic to its homeologous region on Chr5. This result is quite different from a previous study that showed only 20–28% of the genes located on duplicated and sequenced regions of Chr1S and 9L 
were syntenic. At the genome level, 25% of the conserved maize genes maintained their homeologous copy 
. These results suggest that the degree of genome “fractionation” (i.e., loss of one homeologous copy from the ancestral Zea
tetraploid formed 5–12 MYA 
) can be very different in various regions of the genome. As expected, 337 (75%) of the 450 rice genes that are not syntenic to AR182 were observed to be syntenic and colinear in the corresponding maize Chr5 region. In total, 726 (88%) of the 825 rice genes are syntenic to at least one of the two maize syntenic regions. These data strongly support previous proposals that deletion of redundant homologous maize genes is the major factor that disrupts colinearity between maize and other species 
Genome rearrangement and tandem duplication among maize, rice, and sorghum
Comparisons between maize-maize, maize-rice, maize-sorghum, and rice-sorghum revealed several rearrangements. Regions syntenic to AR182 from both maize and sorghum contain large inversion breakpoints that formed independently after the maize-sorghum lineage split (). By contrast, colinearity between the maize homeologous region in Chr5 and the rice genome spans the entire region, with no apparent rearrangement (Figure S10
), indicating that the inversion on maize Chr4 occurred after the ancestral Zea
tetraploidization. Inversions in both maize and sorghum extend beyond the region under study. For sorghum, the inversion breakpoints occur at ~57.1 and ~63.7 Mb. Because the first breakpoint lies outside AR182, the inversion introduces an ~3.1 Mb flanking sequence, bearing some 375 genes, for which homologous genes are absent from the other genomes within the scope of the region. For maize Chr4, the first inversion breakpoint is at ~8.5 Mb, while the second occurs downstream within ctg184 (not shown). This left a gap in rice within which ~68 genes map to ctg184 rather than AR182. Additional, possibly overlapping, inversions occur within maize Chr4, ~2.9 to 4.4 Mb, and this also arose after the whole genome allotetraploidization. Finally, a smaller inversion is conserved in both sorghum and the two homologous regions of maize, corresponding to coordinates ~34.6 to 34.7 Mb in rice. This rearrangement occurred after the rice-sorghum/maize lineage split but its lineage of origin is unclear.
By using rice as a reference genome, one can infer the timing of each rearrangement. All of the rearrangements were observed to be specific to each genome and none were shared among the genomes. Previous studies showed that rice diverged from maize and sorghum about 50–70 million years ago, the ancestors of maize and sorghum diverged about 12 MYA, and the two ancestors of current maize hybridized about 4.8 MYA 
. Combining the evolutionary data of the species with comparisons in AR182-rice-sorghum and maize AR182-rice-maize Chr5, one can infer that these inversions occurred after lineage divergence. The maize Chr5 region demonstrates perfect synteny to rice and therefore preserves the original order and orientation of the ancestors of maize and sorghum. The sorghum genome experienced the inversion after divergence with the ancestors of maize, while the two larger inversions in AR182 of maize Chr4 perhaps arose during genome shuffling after the tetraploid progenitor of maize originated. In sequence divergence (Ks) analysis (see below), indistinguishable distances were observed between sorghum/maize and maize/maize homeologes, indicating a very similar date of lineage divergence with ancestral maize duplication; consistent with the ~12 MYA timing predicted in a previous study 
Extensive tandem gene duplication has been found in Arabidopsis (17%; 
) and rice (14–29%; 
). In AR182, 51 (8.1% of the total) genes were found to be involved in 14 tandem duplication clusters with 2–19 genes in each cluster. Most (9) of the clusters have only two genes. The largest gene family in the region is the 19-member DUF1754 superfamily. This gene family is present in most eukaryotic genomes, including those in mammals, birds, fish, insects, fungi and plants. The biological function of the DUF1754 superfamily is unknown. There is one gene copy in most species (such as human, chimpanzee, chicken, rice and Arabidopsis), two copies in several others (mouse, sorghum, and popular), and seven copies in the bovine genome. The gene was not detected in nematodes. The 19 members in AR182 are distributed in a 1.16 Mb region and are interrupted by twelve other genes. Additionally, there are two other family members in maize, located on Chr3 and 8.
Interestingly, 8 of the 14 gene clusters are not syntenic with either rice or sorghum. In the corresponding co-linear rice region, there are 105 genes (10.6%) involved in 33 duplication clusters with each cluster varying from 2 to 8 genes. Nineteen of the 33 clusters involved only 2 genes and 20 of the 33 clusters have no syntenic relationships to maize AR182. Ninety-two (10.0%) of the sorghum genes were observed to be involved in 37 tandem duplication clusters, with 2 to 7 genes in each cluster. Twenty-six of the 37 sorghum gene clusters have 2 genes and 12 of the 37 clusters have no syntenic relationships with maize.
The synteny data for tandem gene duplication in rice, sorghum, and maize indicate that most of the tandem duplication occurred after lineage divergence, in agreement with previous studies in Drosophila
that tandem duplicated genes tend to be younger with lower survivorships 
High frequencies of mutation and truncation among non-syntenic genes
We are aware of at least two possible processes that would result in non-synteny: gene mobilization from one location to a new location and corresponding gene loss in the other species. Because most genes that are non-syntenic relative to rice are also non-syntenic relative to sorghum, the more parsimonious explanation is that these non-syntenic maize genes were mobilized from elsewhere in the genome. As shown above, mobilization of genes, particularly by transposons such as Helitrons
and Pack-MULEs, frequently result in fragmentation of the amplified/transposed copy 
. To examine this phenomenon, we calculated the ratio of the CDS (coding sequence) length of the maize gene to that of its best scoring rice or sorghum homolog (ortholog for syntenic relationships). While syntenic genes have a single CDS ratio peak centered at one, non-syntenic loci have a bimodal distribution, with a second peak centered at 0.4, indicative of frequent truncation (). As relates to sorghum, 68% of non-syntenic maize genes have a CDS ratio of less than 0.8 whereas only 14% of syntenic loci do. Thus, a substantial proportion of non-syntenic genes are fragmented, consistent with a mechanism of gene mobilization and the likelihood that these are truncated pseudogenes 
Distribution of truncated genes among syntenic and non-syntenic maize loci.
To further characterize these, synonymous (Ks) and non-synonymous (ka) mutation rates (Ks) were measured relative to their best-scoring homologs in sorghum. For this analysis, six potential false-negative syntenic genes were identified by TBLASTN alignment to sorghum, possibly missed due to omission of these genes in the sorghum annotation. shows distributions of Ks and Ka, for maize loci, stratified by synteny relationship and by evidence of truncation using a CDS length ratio threshold of 0.8. Large differences were seen between syntenic genes and non-syntenic genes for characteristics of both Ka and Ks. The Mann-Whitney test 
for non-parametric data showed that these differences are significant. The median Ks is 0.2352 (95%CI 0.2131 to 0.2674) for syntenic loci compared to 0.9769 (95% CI 0.7153–1.5543) for non-syntenic loci (P<0.0001). Ks was significantly different even when considering only genes with a CDS length ratio ≥0.8, For this class the median Ks for syntenic loci was 0.2326 (95%CI 0.2130 to 0.2681), compared to 2.0389 (95% CI 0.3114 to 3.7455) for non-syntenic genes (P<0.0001). Thus, truncation itself is not associated with elevated Ks values. Indeed, the Ks for non-syntenic loci having a CDS length ratio <0.8 (median
0.9064 (95% CI 0.5168 to 1.2777) is not significantly different from those having CDS length ratio ≥0.8 (P
0.6310). Because Ks approximates mutation rate 
, this result suggests that non-syntenic mappings have a more ancient relationship than do the orthologous relationships found in syntenic genes. The rate of non-synonymous mutation (Ka) likewise is elevated among non-syntenic genes. The median Ka for syntenic loci is 0.0442 (95%CI 0.0411 to 0.04889) compared to 0.2965 (95%CI 0.2426 to 0.3981) for non-syntenic loci (P<0.0001). It is clear that non-syntenic loci have vastly different properties compared to syntenic genes and that the identified sorghum homologs of non-syntenic maize genes cannot be regarded as orthologs.
Box-plots showing divergence rates among syntenic (SYN) and non-syntenic (nSYN) maize genes relative to their best scoring homolog in sorghum.
Small RNA analysis
To determine the extent to which the sequence of AR182 may contribute to, or interact with, the small RNA population expressed by the whole maize genome, five small RNA libraries representing different maize tissues and genetic backgrounds were analyzed (see Materials and Methods
for details). Three libraries (B73-zma1, B73-zma2 and B73-zma3) were constructed using small RNA fractions from young leaves, immature ears and immature tassels, respectively, of a B73 genotype. The remaining two libraries (K55-wt and K55-mop1) were previously described by Nobuta et al. 
and include small RNAs from immature ears of wild-type and mop1-1
maize, respectively, in a K55 background. The mop1
gene was shown to encode an ortholog of Arabidopsis
RNA-DEPENDENT RNA POLYMERASE 2 (RDR2) and is required for the establishment of paramutation and the maintenance of transcriptional silencing of transposons and transgenes 
On average, the proportion of distinct small RNAs matching the sequence of AR182 at least once was 12% per library, corresponding to a range of ~147,000 to ~380,000 different small RNA sequences (Table S10
). The leaf tissue libraries exhibited the lowest complexity, with approximately half the rate of matched, distinct sequences compared to any other sample (which all represented reproductive organs) and a distinct to total reads ratio of 16% in this library (compared to more than 30% in the other libraries).
All of the libraries exhibited a similar pattern of size distribution with two prominent peaks at 22 nt and 24 nt respectively (Figure S12
); as expected, in this contig, small RNAs in K55-mop1 presented a strikingly lower proportion of repeat-associated 24-mers compared to K55-wt. Moreover, consistent with prior reports, the 22-nt class predominantly associated with high-copy repeats was more abundant than the 21-mers, both when distinct numbers and total abundances were taken into account 
. Based on these observations, the population of small RNAs matched to AR182 demonstrated small RNA match rates and patterns consistent with other analyses of sub-genomic portions of the maize genome 
Among the small RNAs matching to AR182, 54% had more than two hits in a set of 60 Mb of maize contigs (including this contig from Chr 4, plus two other contigs from chr 1, and 9), suggesting that most small RNAs may be derived from repetitive elements. First, 25–38% of the unique signatures from each library were found to match tandem repeats (), which are known substrates for small RNA biosynthesis 
. Next, to investigate in detail the fraction of small RNA originating from transposons, five principal families of DNA transposons were examined. These families included Harbinger
, two superfamilies of LTR retrotransposons (Copia
), and a family of non-LTR retrotransposons (LINE1) that were mapped and annotated on the chromosome. The data from K55-mop1 was made comparable with the other libraries by dividing the abundance of all small RNAs by 5.3, the average overall enrichment observed for miRNAs in the mop1-1
. All the classes of repetitive elements analyzed expressed larger small RNA populations in the reproductive organs compared to leaves and showed a reduction in the mop1-1
mutant, relative to wild type (Figure S13A and S13
B). Unique small RNAs related to the En-Spm
families were significantly the most frequent among the DNA transposons, irrespective of the tissue and the genetic background. This is consistent with the finding that the mop1
mutation can reverse the methylation status and silencing of Mutator
elements in maize 
, probably via a reduction of the corresponding siRNA population 
. Interestingly, the expected decrease of distinct signatures in K55-mop1
compared to K55-wt was more remarkable for MuDR
than for En-Spm
, in particular when the total abundances were considered (77% reduction vs 50% for the two families, respectively). However, the size distributions of the two populations were very similar, both involving a majority of 24-mers in wild type that are expected to be reduced in a mop1-1
background. Discrepancies between the small RNAs of the two varieties (K55 vs B73) also were observed. Ears from K55 showed slightly higher small RNA abundances for MuDR
than the equivalent tissue of B73 (). Besides, a much more significant difference was observed in the opposite direction for the hAT
family, which was more abundant in the small RNAs of B73 ears (2
1). Further investigations are required to clarify to what extent this phenomenon is determined by different genetic backgrounds, environmental effects, or an imperfect correspondence between the developmental stages of the two samples.
Total number of distinct small RNAs originating from different classes of sequences.
Distributions of DNA transposons and their related small RNAs.
LTR retrotransposons were the most prominent repeat class matched by small RNAs, consistent with their large proportion in the genome. Non-redundant small RNAs mapped within these elements were, when averaged across the libraries, 38-fold more numerous than those matching to DNA transposons. Accordingly, the sum of their abundances was 15-fold greater than the total abundance of DNA transposon-specific small RNAs, after a normalization based on the number of copies in the available contigs. Since the disparity can be only partially accounted for by the difference in unit length between the two classes of repetitive elements, this observation suggested a more pronounced tendency of LTR retrotransposon sequences to be processed into small RNAs, possibly because their replication cycle involves an RNA intermediate. One unexpected observation was that, in every sample, the LTR retrotransposons of the Copia
superfamily in AR182 were represented by much fewer and less abundant small RNAs than the elements of the Gypsy
superfamily (Figure S13
). Because the difference in the total nucleotide length covered by the two superfamilies was negligible, this result suggests that Copia
elements are less prone to provide templates for small RNA biogenesis. Nevertheless, considering the prominent role of siRNAs in the transcriptional silencing of transposable elements, the observed pattern of small RNA generation is not sufficient to explain the very low transcript level reported for most families of Copia
LTR retrotransposons 
. A total of ~28,000 distinct signatures per library were found to match the gene space of AR182. This corresponded to 20% of the set of sRNAs originating from transposable elements, but within the overall length of the 544 genes analyzed, the density of distinct sRNAs was 10-fold larger compared to those in repeats and the mean total abundance per library (~84,000 TPM) was not less than 25% of those from transposons. We noticed that many of the genic small RNAs matched an average of more than two genomic locations, possibly indicating either (1) sequence conservation of paralogs, or (2) mis-annotation of repetitive elements. A separate analysis of exons and introns revealed a strong bias for small RNAs accumulating in the latter, with introns having five times as many distinct small RNAs as exons, including a four-fold larger total abundance after correcting for hits to the contigs (Table S11
). Further analysis demonstrated that 64% of the intronic small RNAs matched to identifiable repetitive elements.
The impact of repetitive elements on the small RNAs in proximity to gene promoters and trailers (upstream and downstream of annotated genes) also was analyzed. The upstream sequences of the 544 genes were investigated in 50 bp windows starting from the putative transcription start site. While the occurrence of TEs gradually increased from 2.7% to 12% between 1 and 200 bp upstream of the genes and again from 12% to 17% between 250 and 400 bp, the number of distinct signatures matching to the first region (1 to 200 bp) was limited and rapidly increased in the second genomic interval (250 to 400 bp). The same pattern was apparent and even more evident when the total abundances of the matching sRNAs were analyzed (Figure S14
). Moreover, after correcting the abundances for the hits to the contigs, a comparison to a region further upstream revealed that the 1–200 bp interval was predominately matched by low-copy-number signatures. The analysis of downstream sequences showed a very similar profile, indicating a general paucity of small RNAs relative to the occurrence of TEs in the flanking regions next to the gene boundaries. However, this reduced set of small RNAs is only observable over a short distance both in 3′ gene trailers and 5′ gene promoters.
Annotation comparison of AR182 to its corresponding region (AGP182) in B73 RefGen_v1
Consistent with the additional data used to construct the pseudomolecule (i.e. overlapping sequences, and ordering and manually orienting sequence contigs based on optical map evidence) and the degree of manual annotation/curation it received, AR182 exhibits significant improvement (see Materials and Methods
) as compared to AGP182, the corresponding sequence in B73 RefGen_v1 
, (). The total sequence contig number was reduced from 1170 to 907 whereas the average size of each contig increased from 18,923 bp to 23,860 bp. Similarly, the number of scaffolds was reduced from 544 in the highly automated AGP182 assembly to 440 in AR182, and the average scaffold size increased from 40,819 bp in AGP182 to 49,238 in AR182.
Sequence and gene content comparison between AR182 and AGP182.
The use of the enhanced AR182 sequence led to slight but detectable differences in the annotation of repetitive elements compared to AGP182 (). While the identified coverage of all TE types was similar between AR182 and AGP182, they appeared less fragmented on AR182 in comparison to AGP182, a finding that is likely due to the improved assembly of AR182. Because the same databases were used to RepeatMask AR182 and AGP182, any difference between them can be attributed to the differences in the level of sequence assembly and improvement. For example, from the comparisons of nested TE insertions (Figure S1A
versus Figure S1B
and Figure S2A
versus Figure S2B
), more complete LTR elements could be detected on AR182. This more complete description of TEs will improve detection sensitivity and characterization of TEs in future projects, and by extension improve the specificity of gene annotations as well.
In the MGSP, the low-copy regions of the genome were finished at high quality. Indeed, with the exception of two 1-bp mismatches, and a 1-bp gap in three genes (highlighted in Table S8
), the sequences of all predicted genes from AGP182 were identical to sequences present in AR182. These variations however had no effect on the three open reading frames and protein translations, except for a single amino acid substitution in gene ZmAcc7g20000928.
Although the current draft sequence is of tremendous utility to the maize/plant genetics research community as it stands today, like any genome sequence and annotation, it could be improved by the application of additional time, resources, new methods, technologies, analysis tools, etc. This manuscript attempts to quantify the benefits of doing so to a reasonable approximation for a small region of the maize genome. Our results demonstrate the feasibility of refining the B73 RefGen_v1 genome assembly by incorporating optical map, high-resolution genetic map, and comparative genomic data sets. Such improvements, along with those of gene and repeat annotation, will serve to promote future functional genomic and phylogenomic research in maize and other grasses.