Conserved non-coding sequences (CNS) are islands of non-coding sequence that, like protein coding exons, show less divergence in sequence between related species than functionless DNA. Several CNSs have been demonstrated experimentally to function as cis-regulatory regions. However, the specific functions of most CNSs remain unknown. Previous searches for CNS in plants have either anchored on exons and only identified nearby sequences or required years of painstaking manual annotation. Here we present an open source tool that can accurately identify CNSs between any two related species with sequenced genomes, including both those immediately adjacent to exons and distal sequences separated by >12 kb of non-coding sequence. We have used this tool to characterize new motifs, associate CNSs with additional functions, and identify previously undetected genes encoding RNA and protein in the genomes of five grass species. We provide a list of 15,363 orthologous CNSs conserved across all grasses tested. We were also able to identify regulatory sequences present in the common ancestor of grasses that have been lost in one or more extant grass lineages. Lists of orthologous gene pairs and associated CNSs are provided for reference inbred lines of arabidopsis, Japonica rice, foxtail millet, sorghum, brachypodium, and maize.
conserved non-coding sequences; comparative genomics; sorghum; rice; maize; gene regulation; genome evolution
Complex traits and other polygenic processes require coordinated gene expression. Co-expression networks model mRNA co-expression: the product of gene regulatory networks. To identify regulatory mechanisms underlying coordinated gene expression in a tissue-enriched context, ten Arabidopsis thaliana co-expression networks were constructed after manually sorting 4,566 RNA profiling datasets into aerial, flower, leaf, root, rosette, seedling, seed, shoot, whole plant, and global (all samples combined) groups. Collectively, the ten networks contained 30% of the measurable genes of Arabidopsis and were circumscribed into 5,491 modules. Modules were scrutinized for cis regulatory mechanisms putatively encoded in conserved non-coding sequences (CNSs) previously identified as remnants of a whole genome duplication event. We determined the non-random association of 1,361 unique CNSs to 1,904 co-expression network gene modules. Furthermore, the CNS elements were placed in the context of known gene regulatory networks (GRNs) by connecting 250 CNS motifs with known GRN cis elements. Our results provide support for a regulatory role of some CNS elements and suggest the functional consequences of CNS activation of co-expression in specific gene sets dispersed throughout the genome.
The grasses, Poaceae, are one of the largest and most successful angiosperm families. Like many radiations of flowering plants, the divergence of the major grass lineages was preceded by a whole-genome duplication (WGD), although these events are not rare for flowering plants. By combining identification of syntenic gene blocks with measures of gene pair divergence and different frequencies of ancient gene loss, we have separated the two subgenomes present in modern grasses. Reciprocal loss of duplicated genes or genomic regions has been hypothesized to reproductively isolate populations and, thus, speciation. However, in contrast to previous studies in yeast and teleost fishes, we found very little evidence of reciprocal loss of homeologous genes between the grasses, suggesting that post-WGD gene loss may not be the cause of the grass radiation. The sets of homeologous and orthologous genes and predicted locations of deleted genes identified in this study, as well as links to the CoGe comparative genomics web platform for analyzing pan-grass syntenic regions, are provided along with this paper as a resource for the grass genetics community.
polyploidy; gene loss; synteny; Poaceae; speciation
The well supported gene dosage hypothesis predicts that genes encoding proteins engaged in dose–sensitive interactions cannot be reduced back to single copies once all interacting partners are simultaneously duplicated in a whole genome duplication. The genomes of extant flowering plants are the result of many sequential rounds of whole genome duplication, yet the fraction of genomes devoted to encoding complex molecular machines does not increase as fast as expected through multiple rounds of whole genome duplications. Using parallel interspecies genomic comparisons in the grasses and crucifers, we demonstrate that genes retained as duplicates following a whole genome duplication have only a 50% chance of being retained as duplicates in a second whole genome duplication. Genes which fractionated to a single copy following a second whole genome duplication tend to be the member of a gene pair with less complex promoters, lower levels of expression, and to be under lower levels of purifying selection. We suggest the copy with lower levels of expression and less purifying selection contributes less to effective gene-product dosage and therefore is under less dosage constraint in future whole genome duplications, providing an explanation for why flowering plant genomes are not overrun with subunits of large dose–sensitive protein complexes.
polyploidy; gene dosage; gene loss; genome evolution; comparative genomics; crucifers; grasses
Epigenetic variation describes heritable differences that are not attributable to changes in DNA sequence. There is the potential for pure epigenetic variation that occurs in the absence of any genetic change or for more complex situations that involve both genetic and epigenetic differences. Methylation of cytosine residues provides one mechanism for the inheritance of epigenetic information. A genome-wide profiling of DNA methylation in two different genotypes of Zea mays (ssp. mays), an organism with a complex genome of interspersed genes and repetitive elements, allowed the identification and characterization of examples of natural epigenetic variation. The distribution of DNA methylation was profiled using immunoprecipitation of methylated DNA followed by hybridization to a high-density tiling microarray. The comparison of the DNA methylation levels in the two genotypes, B73 and Mo17, allowed for the identification of approximately 700 differentially methylated regions (DMRs). Several of these DMRs occur in genomic regions that are apparently identical by descent in B73 and Mo17 suggesting that they may be examples of pure epigenetic variation. The methylation levels of the DMRs were further studied in a panel of near-isogenic lines to evaluate the stable inheritance of the methylation levels and to assess the contribution of cis- and trans- acting information to natural epigenetic variation. The majority of DMRs that occur in genomic regions without genetic variation are controlled by cis-acting differences and exhibit relatively stable inheritance. This study provides evidence for naturally occurring epigenetic variation in maize, including examples of pure epigenetic variation that is not conditioned by genetic differences. The epigenetic differences are variable within maize populations and exhibit relatively stable trans-generational inheritance. The detected examples of epigenetic variation, including some without tightly linked genetic variation, may contribute to complex trait variation.
Heritable variation within a species provides the basis for natural and artificial selection. A substantial portion of heritable variation is based on alterations in DNA sequence among individuals and is termed genetic variation. There is also evidence for epigenetic variation, which refers to heritable differences that are not caused by DNA sequence changes. Methylation of cytosine residues provides one molecular mechanism for epigenetic variation in many eukaryotic species. The genome-wide distribution of DNA methylation was assessed in two different inbred genotypes of maize to identify differentially methylated regions that may contribute to epigenetic variation. There are hundreds of genomic regions that have differences in DNA methylation levels in these two different genotypes, including methylation differences in regions without genetic variation. By studying the inheritance of the differential methylation in near-isogenic progeny of the two inbred lines, it is possible to demonstrate relatively stable inheritance of epigenetic variation, even in the absence of DNA sequence changes. The epigenetic variation among individuals of the same species may provide important contributions to phenotypic variation within a species even in the absence of genetic differences.
It is difficult to accurately interpret chromosomal correspondences such as true orthology and paralogy due to significant divergence of genomes from a common ancestor. Analyses are particularly problematic among lineages that have repeatedly experienced whole genome duplication (WGD) events. To compare multiple "subgenomes" derived from genome duplications, we need to relax the traditional requirements of "one-to-one" syntenic matchings of genomic regions in order to reflect "one-to-many" or more generally "many-to-many" matchings. However this relaxation may result in the identification of synteny blocks that are derived from ancient shared WGDs that are not of interest. For many downstream analyses, we need to eliminate weak, low scoring alignments from pairwise genome comparisons. Our goal is to objectively select subset of synteny blocks whose total scores are maximized while respecting the duplication history of the genomes in comparison. We call this "quota-based" screening of synteny blocks in order to appropriately fill a quota of syntenic relationships within one genome or between two genomes having WGD events.
We have formulated the synteny block screening as an optimization problem known as "Binary Integer Programming" (BIP), which is solved using existing linear programming solvers. The computer program QUOTA-ALIGN performs this task by creating a clear objective function that maximizes the compatible set of synteny blocks under given constraints on overlaps and depths (corresponding to the duplication history in respective genomes). Such a procedure is useful for any pairwise synteny alignments, but is most useful in lineages affected by multiple WGDs, like plants or fish lineages. For example, there should be a 1:2 ploidy relationship between genome A and B if genome B had an independent WGD subsequent to the divergence of the two genomes. We show through simulations and real examples using plant genomes in the rosid superorder that the quota-based screening can eliminate ambiguous synteny blocks and focus on specific genomic evolutionary events, like the divergence of lineages (in cross-species comparisons) and the most recent WGD (in self comparisons).
The QUOTA-ALIGN algorithm screens a set of synteny blocks to retain only those compatible with a user specified ploidy relationship between two genomes. These blocks, in turn, may be used for additional downstream analyses such as identifying true orthologous regions in interspecific comparisons. There are two major contributions of QUOTA-ALIGN: 1) reducing the block screening task to a BIP problem, which is novel; 2) providing an efficient software pipeline starting from all-against-all BLAST to the screened synteny blocks with dot plot visualizations. Python codes and full documentations are publicly available http://github.com/tanghaibao/quota-alignment. QUOTA-ALIGN program is also integrated as a major component in SynMap http://genomevolution.com/CoGe/SynMap.pl, offering easier access to thousands of genomes for non-programmers.
Not all genes are created equal. Despite being supported by sequence conservation
and expression data, knockout homozygotes of many genes show no visible effects,
at least under laboratory conditions. We have identified a set of maize
(Zea mays L.) genes which have been the subject of a
disproportionate share of publications recorded at MaizeGDB. We manually
anchored these “classical” maize genes to gene models in the B73
reference genome, and identified syntenic orthologs in other grass genomes. In
addition to proofing the most recent version 2 maize gene models, we show that a
subset of these genes, those that were identified by morphological phenotype
prior to cloning, are retained at syntenic locations throughout the grasses at
much higher levels than the average expressed maize gene, and are preferentially
found on the maize1 subgenome even with a duplicate copy is still retained on
the opposite subgenome. Maize1 is the subgenome that experienced less gene loss
following the whole genome duplication in maize lineage 5–12 million years
ago and genes located on this subgenome tend to be expressed at higher levels in
modern maize. Links to the web based software that supported our syntenic
analyses in the grasses should empower further research and support teaching
involving the history of maize genetic research. Our findings exemplify the
concept of “grasses as a single genetic system,” where what is
learned in one grass may be applied to another.
We here develop computational methods to facilitate use of 454 whole genome shotgun sequencing to identify mutations in Escherichia coli K12. We had Roche sequence eight related strains derived as spontaneous mutants in a background without a whole genome sequence. They provided difference tables based on assembling each genome to reference strain E. coli MG1655 (NC_000913). Due to the evolutionary distance to MG1655, these contained a large number of both false negatives and positives. By manual analysis of the dataset, we detected all the known mutations (24 at nine locations) and identified and genetically confirmed new mutations necessary and sufficient for the phenotypes we had selected in four strains. We then had Roche assemble contigs de novo, which we further assembled to full-length pseudomolecules based on synteny with MG1655. This hybrid method facilitated detection of insertion mutations and allowed annotation from MG1655. After removing one genome with less than the optimal 20- to 30-fold sequence coverage, we identified 544 putative polymorphisms that included all of the known and selected mutations apart from insertions. Finally, we detected seven new mutations in a total of only 41 candidates by comparing single genomes to composite data for the remaining six and using a ranking system to penalize homopolymer sequencing and misassembly errors. An additional benefit of the analysis is a table of differences between MG1655 and a physiologically robust E. coli wild-type strain NCM3722. Both projects were greatly facilitated by use of comparative genomics tools in the CoGe software package (http://genomevolution.org/).
Whole genome duplications, or tetraploidies, are an important source of increased gene content. Following whole genome duplication, duplicate copies of many genes are lost from the genome. This loss of genes is biased both in the classes of genes deleted and the subgenome from which they are lost. Many or all classes are genes preferentially retained as duplicate copies are engaged in dose sensitive protein–protein interactions, such that deletion of any one duplicate upsets the status quo of subunit concentrations, and presumably lowers fitness as a result. Transcription factors are also preferentially retained following every whole genome duplications studied. This has been explained as a consequence of protein–protein interactions, just as for other highly retained classes of genes. We show that the quantity of conserved noncoding sequences (CNSs) associated with genes predicts the likelihood of their retention as duplicate pairs following whole genome duplication. As many CNSs likely represent binding sites for transcriptional regulators, we propose that the likelihood of gene retention following tetraploidy may also be influenced by dose–sensitive protein–DNA interactions between the regulatory regions of CNS-rich genes – nicknamed bigfoot genes – and the proteins that bind to them. Using grass genomes, we show that differential loss of CNSs from one member of a pair following the pre-grass tetraploidy reduces its chance of retention in the subsequent maize lineage tetraploidy.
conserved non-coding sequence; polyploidy; fractionation; gene dosage; gene regulation
Following genome duplication and selfish DNA expansion, maize used a heretofore unknown mechanism to shed redundant genes and functionless DNA with bias toward one of the parental genomes.
Previous work in Arabidopsis showed that after an ancient tetraploidy event, genes were preferentially removed from one of the two homeologs, a process known as fractionation. The mechanism of fractionation is unknown. We sought to determine whether such preferential, or biased, fractionation exists in maize and, if so, whether a specific mechanism could be implicated in this process. We studied the process of fractionation using two recently sequenced grass species: sorghum and maize. The maize lineage has experienced a tetraploidy since its divergence from sorghum approximately 12 million years ago, and fragments of many knocked-out genes retain enough sequence similarity to be easily identifiable. Using sorghum exons as the query sequence, we studied the fate of both orthologous genes in maize following the maize tetraploidy. We show that genes are predominantly lost, not relocated, and that single-gene loss by deletion is the rule. Based on comparisons with orthologous sorghum and rice genes, we also infer that the sequences present before the deletion events were flanked by short direct repeats, a signature of intra-chromosomal recombination. Evidence of this deletion mechanism is found 2.3 times more frequently on one of the maize homeologs, consistent with earlier observations of biased fractionation. The over-fractionated homeolog is also a greater than 3-fold better target for transposon removal, but does not have an observably higher synonymous base substitution rate, nor could we find differentially placed methylation domains. We conclude that fractionation is indeed biased in maize and that intra-chromosomal or possibly a similar illegitimate recombination is the primary mechanism by which fractionation occurs. The mechanism of intra-chromosomal recombination explains the observed bias in both gene and transposon loss in the maize lineage. The existence of fractionation bias demonstrates that the frequency of deletion is modulated. Among the evolutionary benefits of this deletion/fractionation mechanism is bulk DNA removal and the generation of novel combinations of regulatory sequences and coding regions.
All genomes can accumulate dispensable DNA in the form of duplications of individual genes or even partial or whole genome duplications. Genomes also can accumulate selfish DNA elements. Duplication events specifically are often followed by extensive gene loss. The maize genome is particularly extreme, having become tetraploid 10 million years ago and played host to massive transposon amplifications. We compared the genome of sorghum (which is homologous to the pre-tetraploid maize genome) with the two identifiable parental genomes retained in maize. The two maize genomes differ greatly: one of the parental genomes has lost 2.3 times more genes than the other, and the selfish DNA regions between genes were even more frequently lost, suggesting maize can distinguish between the parental genomes present in the original tetraploid. We show that genes are actually lost, not simply relocated. Deletions were rarely longer than a single gene, and occurred between repeated DNA sequences, suggesting mis-recombination as a mechanism of gene removal. We hypothesize an epigenetic mechanism of genome distinction to account for the selective loss. To the extent that the rate of base substitutions tracks time, we neither support nor refute claims of maize allotetraploidy. Finally, we explain why it makes sense that purifying selection in mammals does not operate at all like the gene and genome deletion program we describe here.
Local gene duplication is a prominent mechanism of gene copy number expansion. Elucidating the mechanisms by which local duplicates arise is necessary in understanding the evolution of genomes and their host organisms. Chromosome one of Arabidopsis thaliana contains an 81-gene array subdivided into 27 triplet units (t-units), with each t-unit containing three pre-transfer RNA genes. We utilized phylogenetic tree reconstructions and comparative genomics to order the events leading to the array’s formation, and propose a model using unequal crossing-over as the primary mechanism of array formation. The model is supported by additional phylogenetic information from intergenic spacer sequences separating each t-unit, comparative analysis to an orthologous array of 12 t-units in the sister taxa Arabidopsis lyrata, and additional modeling using a stochastic simulation of orthologous array divergence. Lastly, comparative phylogenetic analysis demonstrates that the two orthologous t-unit arrays undergo concerted evolution within each taxa and are likely fluctuating in copy number under neutral evolutionary drift. These findings hold larger implications for future research concerning gene and genome evolution.
Electronic supplementary material
The online version of this article (doi:10.1007/s00239-010-9350-2) contains supplementary material, which is available to authorized users.
Comparative genomics; Concerted evolution; Phylogenetics; Tandem duplication; Local duplication; Gene array; tRNA; Arabidopsis; Synteny; Copy number variation
Much of the eukaryotic genome is known to be mobile, largely due to the movement of transposons and other parasitic elements. Recent work in plants and Drosophila suggests that mobility is also a feature of many nontransposon genes and gene families. Indeed, analysis of the Arabidopsis genome suggested that as many as half of all genes had moved to unlinked positions since Arabidopsis diverged from papaya roughly 72 million years ago, and that these mobile genes tend to fall into distinct gene families. However, the mechanism by which single gene transposition occurred was not deduced. By comparing two closely related species, Arabidopsis thaliana and Arabidopsis lyrata, we sought to determine the nature of gene transposition in Arabidopsis. We found that certain categories of genes are much more likely to have transposed than others, and that many of these transposed genes are flanked by direct repeat sequence that was homologous to sequence within the orthologous target site in A. lyrata and which was predominantly genic in identity. We suggest that intrachromosomal recombination between tandemly duplicated sequences, and subsequent insertion of the circular product, is the predominant mechanism of gene transposition.
Repetitive DNA, such as satellite repeats and transposons, is ubiquitous throughout the genome. Such repeats have been associated with DNA loss, circle formation, and gene transposition in plants and Drosophila. In this work we suggest that, in plants, one mechanism of gene mobility is intrachromosomal recombination via tandem repeats. In addition, we have demonstrated that the classes of genes that tend to form tandem duplications are more likely to have transposed than other gene classes. We conclude that tandem duplications may particularly facilitate gene excision and may also provide targets for gene insertion.
In animals and yeast, position effects have been well documented. In animals, the best example of this process is Position Effect Variegation (PEV) in Drosophila melanogaster. In PEV, when genes are moved into close proximity to constitutive heterochromatin, their expression can become unstable, resulting in variegated patches of gene expression. This process is regulated by a variety of proteins implicated in both chromatin remodeling and RNAi-based silencing. A similar phenomenon is observed when transgenes are inserted into heterochromatic regions in fission yeast. In contrast, there are few examples of position effects in plants, and there are no documented examples in either plants or animals for positions that are associated with the reversal of previously established silenced states. MuDR transposons in maize can be heritably silenced by a naturally occurring rearranged version of MuDR. This element, Muk, produces a long hairpin RNA molecule that can trigger DNA methylation and heritable silencing of one or many MuDR elements. In most cases, MuDR elements remain inactive even after Muk segregates away. Thus, Muk-induced silencing involves a directed and heritable change in gene activity in the absence of changes in DNA sequence. Using classical genetic analysis, we have identified an exceptional position at which MuDR element silencing is unstable. Muk effectively silences the MuDR element at this position. However, after Muk is segregated away, element activity is restored. This restoration is accompanied by a reversal of DNA methylation. To our knowledge, this is the first documented example of a position effect that is associated with the reversal of epigenetic silencing. This observation suggests that there are cis-acting sequences that alter the propensity of an epigenetically silenced gene to remain inactive. This raises the interesting possibility that an important feature of local chromatin environments may be the capacity to erase previously established epigenetic marks.
Epigenetics involves the heritable alteration of gene activity without changes in DNA sequence. Although clearly a repository for heritable information, what makes epigenetic states distinct is that they are far more labile than those associated with DNA sequence. The epigenetic landscape of eukaryotic genomes is far from uniform. Vast stretches of them are effectively epigenetically silenced, while other regions are largely active. The experiments described here suggest that the propensity to maintain heritable epigenetic states can vary depending on position within the genome. Because transposable elements, or transposons, move from place to place within the genome, they make an ideal probe for differences in epigenetic states at various positions. Our model system uses a single transposon, MuDR in maize, and a variant of MuDR, Mu killer (Muk). When MuDR and Muk are combined genetically, MuDR elements become epigenetically silenced, and they generally remain so even after Muk is lost in subsequent generations. However, we have identified a particular position at which the MuDR element reactivates after Muk is lost. These data show that there are some parts of the maize genome that are either competent to erase epigenetic silencing or are incapable of maintaining it. These results suggest that erasure of heritable information may be an important component of epigenetic regulation.
Paramutation and transposon silencing are two epigenetic phenomena that have intrigued and puzzled geneticists for decades. Each involves heritable changes in gene activity without changes in DNA sequence. Here we report the cloning of a gene whose activity is required for the maintenance of both silenced transposons and paramutated color genes in maize. We show that this gene, Mop1 (Mediator of paramutation1) codes for a putative RNA-dependent RNA polymerase, whose activity is required for the production of small RNAs that correspond to the MuDR transposon sequence. We also demonstrate that although Mop1 is required to maintain MuDR methylation and silencing, it is not required for the initiation of heritable silencing. In contrast, we present evidence that a reduction in the transcript level of a maize homolog of the nucleosome assembly protein 1 histone chaperone can reduce the heritability of MuDR silencing. Together, these data suggest that the establishment and maintenance of MuDR silencing have distinct requirements.
Silencing of the transposon systemMu in maize is maintained byMop1 (necessary for the presence ofMuDR small RNAs). This silencing is mediated by a histone chaperone NAP1 ortholog.
The majority of well-documented cases of horizontal transfer between higher eukaryotes involve the movement of transposable elements between animals. Surprisingly, although plant genomes often contain vast numbers of these mobile genetic elements, no evidence of horizontal transfer of a nuclear-encoded transposon between plant species has been detected to date. The most mutagenic known plant transposable element system is the Mutator system in maize. Mu-like elements (MULEs) are widespread among plants, and previous analysis has suggested that the distribution of various subgroups of MULEs is patchy, consistent with horizontal transfer. We have sequenced portions of MULE transposons from a number of species of the genus Setaria and compared them to each other and to publicly available databases. A subset of these elements is remarkably similar to a small family of MULEs in rice. A comparison of noncoding and synonymous sequences revealed that the observed similarity is not due to selection at the amino acid level. Given the amount of time separating Setaria and rice, the degree of similarity between these elements excludes the possibility of simple vertical transmission of this class of MULEs. This is the first well-documented example of horizontal transfer of any nuclear-encoded genes between higher plants.
Sequencing and analysis of MULE transposons and their surrounding genomic regions from closely related grass species and rice provides evidence of horizontal transfer in plants.
Zea mays DataBase (ZmDB) seeks to provide a comprehensive view of maize (corn) genetics by linking genomic sequence data with gene expression analysis and phenotypes of mutant plants. ZmDB originated in 1999 as the Web portal for a large project of maize gene discovery, sequencing and phenotypic analysis using a transposon tagging strategy and expressed sequence tag (EST) sequencing. Recently, ZmDB has broadened its scope to include all public maize ESTs, genome survey sequences (GSSs), and protein sequences. More than 170 000 ESTs are currently clustered into ∼20 000 contigs and about an equal number of apparent singlets. These clusters are continuously updated and annotated with respect to potential encoded protein products. More than 100 000 GSSs are similarly assembled and annotated by spliced alignment with EST and protein sequences. The ZmDB interface provides quick access to analytical tools for further sequence analysis. Every sequence record is linked to several display options and similarity search tools, including services for multiple sequence alignment, protein domain determination and spliced alignment. Furthermore, ZmDB provides web-based ordering of materials generated in the project, including ESTs, ordered collections of genomic sequences tagged with the RescueMu transposon and microarrays of amplified ESTs. ZmDB can be accessed at http://zmdb.iastate.edu/.