The TATA-box and TATA-variants are regulatory elements involved in the formation of a transcription initiation complex. Both have been conserved throughout evolution in a restricted region close to the Transcription Start Site (TSS). However, less than half of the genes in model organisms studied so far have been found to contain either one of these elements. Indeed different core-promoter elements are involved in the recruitment of the TATA-box-binding protein. Here we assessed the possibility of identifying novel functional motifs in plant genes, sharing the TATA-box topological constraints.
We developed an ab-initio approach considering the preferential location of motifs relative to the TSS. We identified motifs observed at the TATA-box expected location and conserved in both Arabidopsis thaliana and Oryza sativa promoters. We identified TC-elements within non-TA-rich promoters 30 bases upstream of the TSS. As with the TATA-box and TATA-variant sequences, it was possible to construct a unique distance graph with the TC-element sequences. The structural and functional features of TC-element-containing genes were distinct from those of TATA-box- or TATA-variant-containing genes. Arabidopsis thaliana transcriptome analysis revealed that TATA-box-containing genes were generally those showing relatively high levels of expression and that TC-element-containing genes were generally those expressed in specific conditions.
Our observations suggest that the TC-elements might constitute a class of novel regulatory elements participating towards the complex modulation of gene expression in plants.
Transcription initiation, essential to gene expression regulation, involves recruitment of basal transcription factors to the core promoter elements (CPEs). The distribution of currently known CPEs across plant genomes is largely unknown. This is the first large scale genome-wide report on the computational prediction of CPEs across eight plant genomes to help better understand the transcription initiation complex assembly. The distribution of thirteen known CPEs across four monocots (Brachypodium distachyon, Oryza sativa ssp. japonica, Sorghum bicolor, Zea mays) and four dicots (Arabidopsis thaliana, Populus trichocarpa, Vitis vinifera, Glycine max) reveals the structural organization of the core promoter in relation to the TATA-box as well as with respect to other CPEs. The distribution of known CPE motifs with respect to transcription start site (TSS) exhibited positional conservation within monocots and dicots with slight differences across all eight genomes. Further, a more refined subset of annotated genes based on orthologs of the model monocot (O. sativa ssp. japonica) and dicot (A. thaliana) genomes supported the positional distribution of these thirteen known CPEs. DNA free energy profiles provided evidence that the structural properties of promoter regions are distinctly different from that of the non-regulatory genome sequence. It also showed that monocot core promoters have lower DNA free energy than dicot core promoters. The comparison of monocot and dicot promoter sequences highlights both the similarities and differences in the core promoter architecture irrespective of the species-specific nucleotide bias. This study will be useful for future work related to genome annotation projects and can inspire research efforts aimed to better understand regulatory mechanisms of transcription.
Chloroplasts and mitochondria evolved from the endosymbionts of once free-living eubacteria, and they transferred most of their genes to the host nuclear genome during evolution. The mechanisms used by plants to coordinate the expression of such transferred genes, as well as other genes in the host nuclear genome, are still poorly understood.
In this paper, we use nuclear-encoded chloroplast (cpRPGs), as well as mitochondrial (mtRPGs) and cytoplasmic (euRPGs) ribosomal protein genes to study the coordination of gene expression between organelles and the host. Results show that the mtRPGs, but not the cpRPGs, exhibit strongly synchronized expression with euRPGs in all investigated land plants and that this phenomenon is linked to the presence of a telo-box DNA motif in the promoter regions of mtRPGs and euRPGs. This motif is also enriched in the promoter regions of genes involved in DNA replication. Sequence analysis further indicates that mtRPGs, in contrast to cpRPGs, acquired telo-box from the host nuclear genome.
Based on our results, we propose a model of plant nuclear genome evolution where coordination of activities in mitochondria and chloroplast and other cellular functions, including cell cycle, might have served as a strong selection pressure for the differential acquisition of telo-box between mtRPGs and cpRPGs. This research also highlights the significance of physiological needs in shaping transcriptional regulatory evolution.
Transcription factor binding is regulated by several interactions, primarily involving cis-element binding. These binding sites maintain specificity by means of their sequence, and other additional factors such as inter-motif distance and spacer specificity. The ACGT core sequence has been established as a functionally important cis-element which frequently regulates gene expression in synergy with other cis-elements. In this study, we used two monocotyledonous – Oryza sativa and Sorghum bicolor, and two dicotyledonous species – Arabidopsis thaliana and Glycine max to analyze the conservation of co-occurring ACGT core elements in plant promoters with respect to spacer distance between them. Using data generated from Arabidopsis thaliana and Oryza sativa, we also identified conserved regions across all spacers and possible conditions regulating gene promoters with multiple ACGT cis-elements.
Our data indicated specific predominant spacer lengths between co-occurring ACGT elements, but these lengths were not universally conserved across all species under analysis. However, the frequency distribution indicated local regions of high correlation among monocots and dicots. Sequence specificity data clearly revealed a preference for G at the first and C at the terminal position of a spacer sequence, suggesting that the G-box motif is the most prevalent for the ACGT class of promoters. Using gene expression databases, we also observed trends suggesting that co-occurring ACGT elements are responsible for gene regulation in response to exogenous stress. Conservation in patterns of ACGT (N) ACGT among orthologous genes also indicated the possibility that emergence of functional significance across species was a result of parallel evolution of these cis-elements.
Although the importance of ACGT elements has been acknowledged for several plant species, ours is the first study that attempts to compare their occurrence across four species and analyze conservation among them. The apparent preference for particular spacer distances suggest that these motifs might be implicated in important physiological functions which are yet to be identified. Variations in correlation patterns among monocots and dicots might arise out of differences in transcriptional regulation in the two classes. In accordance with literature, we established the involvement of co-occurring ACGT elements in stress responses and showed how this regulation differs with variation in the ACGT (N) ACGT motif. We believe that our study will be an essential resource in determining optimum spacer length and spacer sequence between ACGT elements for promoter design in future.
Gene expression; Promoter regulation; Cis-element; Stress; Spacer; Arabidopsis; Rice; Soybean; Sorghum; Inter-motif distance
RNA helicases are enzymes that are thought to unwind double-stranded RNA molecules in an energy-dependent fashion through the hydrolysis of NTP. RNA helicases are associated with all processes involving RNA molecules, including nuclear transcription, editing, splicing, ribosome biogenesis, RNA export, and organelle gene expression. The involvement of RNA helicase in response to stress and in plant growth and development has been reported previously. While their importance in Arabidopsis and Oryza sativa has been partially studied, the function of RNA helicase proteins is poorly understood in Zea mays and Glycine max. In this study, we identified a total of RNA helicase genes in Arabidopsis and other crop species genome by genome-wide comparative in silico analysis. We classified the RNA helicase genes into three subfamilies according to the structural features of the motif II region, such as DEAD-box, DEAH-box and DExD/H-box, and different species showed different patterns of alternative splicing. Secondly, chromosome location analysis showed that the RNA helicase protein genes were distributed across all chromosomes with different densities in the four species. Thirdly, phylogenetic tree analyses identified the relevant homologs of DEAD-box, DEAH-box and DExD/H-box RNA helicase proteins in each of the four species. Fourthly, microarray expression data showed that many of these predicted RNA helicase genes were expressed in different developmental stages and different tissues under normal growth conditions. Finally, real-time quantitative PCR analysis showed that the expression levels of 10 genes in Arabidopsis and 13 genes in Zea mays were in close agreement with the microarray expression data. To our knowledge, this is the first report of a comparative genome-wide analysis of the RNA helicase gene family in Arabidopsis, Oryza sativa, Zea mays and Glycine max. This study provides valuable information for understanding the classification and putative functions of the RNA helicase gene family in crop growth and development.
Genome-wide identification and phylogenetic and syntenic comparison were performed for the genes responsible for phenylalanine ammonia lyase (PAL) and peroxidase A (POX A) enzymes in nine plant species representing very diverse groups like legumes (Glycine max and Medicago truncatula), fruits (Vitis vinifera), cereals (Sorghum bicolor, Zea mays, and Oryza sativa), trees (Populus trichocarpa), and model dicot (Arabidopsis thaliana) and monocot (Brachypodium distachyon) species. A total of 87 and 1045 genes in PAL and POX A gene families, respectively, have been identified in these species. The phylogenetic and syntenic comparison along with motif distributions shows a high degree of conservation of PAL genes, suggesting that these genes may predate monocot/eudicot divergence. The POX A family genes, present in clusters at the subtelomeric regions of chromosomes, might be evolving and expanding with higher rate than the PAL gene family. Our analysis showed that during the expansion of POX A gene family, many groups and subgroups have evolved, resulting in a high level of functional divergence among monocots and dicots. These results will act as a first step toward the understanding of monocot/eudicot evolution and functional characterization of these gene families in the future.
Plant promoter architecture is important for understanding regulation and evolution of the promoters, but our current knowledge about plant promoter structure, especially with respect to the core promoter, is insufficient. Several promoter elements including TATA box, and several types of transcriptional regulatory elements have been found to show local distribution within promoters, and this feature has been successfully utilized for extraction of promoter constituents from human genome.
LDSS (Local Distribution of Short Sequences) profiles of short sequences along the plant promoter have been analyzed in silico, and hundreds of hexamer and octamer sequences have been identified as having localized distributions within promoters of Arabidopsis thaliana and rice. Based on their localization patterns, the identified sequences could be classified into three groups, pyrimidine patch (Y Patch), TATA box, and REG (Regulatory Element Group). Sequences of the TATA box group are consistent with the ones reported in previous studies. The REG group includes more than 200 sequences, and half of them correspond to known cis-elements. The other REG subgroups, together with about a hundred uncategorized sequences, are suggested to be novel cis-regulatory elements. Comparison of LDSS-positive sequences between Arabidopsis and rice has revealed moderate conservation of elements and common promoter architecture. In addition, a dimer motif named the YR Rule (C/T A/G) has been identified at the transcription start site (-1/+1). This rule also fits both Arabidopsis and rice promoters.
LDSS was successfully applied to plant genomes and hundreds of putative promoter elements have been extracted as LDSS-positive octamers. Identified promoter architecture of monocot and dicot are well conserved, but there are moderate variations in the utilized sequences.
Completely sequenced plant genomes provide scope for designing a large number of microsatellite markers, which are useful in various aspects of crop breeding and genetic analysis. With the objective of developing genic but non-coding microsatellite (GNMS) markers for the rice (Oryza sativa L.) genome, we characterized the frequency and relative distribution of microsatellite repeat-motifs in 18,935 predicted protein coding genes including 14,308 putative promoter sequences.
We identified 19,555 perfect GNMS repeats with densities ranging from 306.7/Mb in chromosome 1 to 450/Mb in chromosome 12 with an average of 357.5 GNMS per Mb. The average microsatellite density was maximum in the 5' untranslated regions (UTRs) followed by those in introns, promoters, 3'UTRs and minimum in the coding sequences (CDS). Primers were designed for 17,966 (92%) GNMS repeats, including 4,288 (94%) hypervariable class I types, which were bin-mapped on the rice genome. The GNMS markers were most polymorphic in the intronic region (73.3%) followed by markers in the promoter region (53.3%) and least in the CDS (26.6%). The robust polymerase chain reaction (PCR) amplification efficiency and high polymorphic potential of GNMS markers over genic coding and random genomic microsatellite markers suggest their immediate use in efficient genotyping applications in rice. A set of these markers could assess genetic diversity and establish phylogenetic relationships among domesticated rice cultivar groups. We also demonstrated the usefulness of orthologous and paralogous conserved non-coding microsatellite (CNMS) markers, identified in the putative rice promoter sequences, for comparative physical mapping and understanding of evolutionary and gene regulatory complexities among rice and other members of the grass family. The divergence between long-grained aromatics and subspecies japonica was estimated to be more recent (0.004 Mya) compared to short-grained aromatics from japonica (0.006 Mya) and long-grained aromatics from subspecies indica (0.014 Mya).
Our analyses showed that GNMS markers with their high polymorphic potential would be preferred candidate functional markers in various marker-based applications in rice genetics, genomics and breeding. The CNMS markers provided encouraging implications for their use in comparative genome mapping and understanding of evolutionary complexities in rice and other members of grass family.
The protein phosphatase 2Cs (PP2Cs) from various organisms have been implicated to act as negative modulators of protein kinase pathways involved in diverse environmental stress responses and developmental processes. A genome-wide overview of the PP2C gene family in plants is not yet available.
A comprehensive computational analysis identified 80 and 78 PP2C genes in Arabidopsis thaliana (AtPP2Cs) and Oryza sativa (OsPP2Cs), respectively, which denotes the PP2C gene family as one of the largest families identified in plants. Phylogenic analysis divided PP2Cs in Arabidopsis and rice into 13 and 11 subfamilies, respectively, which are supported by the analyses of gene structures and protein motifs. Comparative analysis between the PP2C genes in Arabidopsis and rice identified common and lineage-specific subfamilies and potential 'gene birth-and-death' events. Gene duplication analysis reveals that whole genome and chromosomal segment duplications mainly contributed to the expansion of both OsPP2Cs and AtPP2Cs, but tandem or local duplication occurred less frequently in Arabidopsis than rice. Some protein motifs are widespread among the PP2C proteins, whereas some other motifs are specific to only one or two subfamilies. Expression pattern analysis suggests that 1) most PP2C genes play functional roles in multiple tissues in both species, 2) the induced expression of most genes in subfamily A by diverse stimuli indicates their primary role in stress tolerance, especially ABA response, and 3) the expression pattern of subfamily D members suggests that they may constitute positive regulators in ABA-mediated signaling pathways. The analyses of putative upstream regulatory elements by two approaches further support the functions of subfamily A in ABA signaling, and provide insights into the shared and different transcriptional regulation machineries in dicots and monocots.
This comparative genome-wide overview of the PP2C family in Arabidopsis and rice provides insights into the functions and regulatory mechanisms, as well as the evolution and divergence of the PP2C genes in dicots and monocots. Bioinformatics analyses suggest that plant PP2C proteins from different subfamilies participate in distinct signaling pathways. Our results have established a solid foundation for future studies on the functional divergence in different PP2C subfamilies.
Telomere maintenance is essential to preserve genomic stability and involves several telomere-specific proteins as well as DNA replication and repair proteins. The kinase ATR, which has a crucial function in maintaining genome integrity from yeast to human, has been shown to be involved in telomere maintenance in several eukaryotic organisms, including yeast, Arabidopsis and Drosophila. However, its role in telomere maintenance in mammals remains poorly explored. Here, we report by using telomere-fluorescence in situ hybridization (Telo-FISH) on metaphase chromosomes that ATR deficiency causes telomere instability both in primary human fibroblasts from Seckel syndrome patients and in HeLa cells. The telomere aberrations resulting from ATR deficiency (i.e. sister telomere fusions and chromatid-type telomere aberrations) are mainly generated during and/or after telomere replication, and involve both leading and lagging strand telomeres as shown by chromosome orientation-FISH (CO-FISH). Moreover, we show that ATR deficiency strongly sensitizes cells to the G-quadruplex ligand 360A, enhancing sister telomere fusions and chromatid-type telomere aberrations involving specifically the lagging strand telomeres. Altogether, these data reveal that ATR plays a critical role in telomere maintenance during and/or after telomere replication in human cells.
The genomes of three plants, Arabidopsis (Arabidopsis thaliana), rice (Oryza sativa), and soybean (Glycine max), have been sequenced, and their many genes and promoters have been predicted. In Arabidopsis, cis-acting promoter elements involved in cold- and dehydration-responsive gene expression have been extensively analysed; however, the characteristics of such cis-acting promoter sequences in cold- and dehydration-inducible genes of rice and soybean remain to be clarified. In this study, we performed microarray analyses using the three species, and compared characteristics of identified cold- and dehydration-inducible genes. Transcription profiles of the cold- and dehydration-responsive genes were similar among these three species, showing representative upregulated (dehydrin/LEA) and downregulated (photosynthesis-related) genes. All (46 = 4096) hexamer sequences in the promoters of the three species were investigated, revealing the frequency of conserved sequences in cold- and dehydration-inducible promoters. A core sequence of the abscisic acid-responsive element (ABRE) was the most conserved in dehydration-inducible promoters of all three species, suggesting that transcriptional regulation for dehydration-inducible genes is similar among these three species, with the ABRE-dependent transcriptional pathway. In contrast, for cold-inducible promoters, the conserved hexamer sequences were diversified among these three species, suggesting the existence of diverse transcriptional regulatory pathways for cold-inducible genes among the species.
plant genome; cis-acting promoter elements; cold; dehydration; microarray
RNA editing is the process whereby an RNA sequence is modified from the sequence of the corresponding DNA template. In the mitochondria of land plants, some cytidines are converted to uridines before translation. Despite substantial study, the molecular biological mechanism by which C-to-U RNA editing proceeds remains relatively obscure, although several experimental studies have implicated a role for cis-recognition. A highly non-random distribution of nucleotides is observed in the immediate vicinity of edited sites (within 20 nucleotides 5' and 3'), but no precise consensus motif has been identified.
Data for analysis were derived from the the complete mitochondrial genomes of Arabidopsis thaliana, Brassica napus, and Oryza sativa; additionally, a combined data set of observations across all three genomes was generated. We selected datasets based on the 20 nucleotides 5' and the 20 nucleotides 3' of edited sites and an equivalently sized and appropriately constructed null-set of non-edited sites. We used tree-based statistical methods and random forests to generate models of C-to-U RNA editing based on the nucleotides surrounding the edited/non-edited sites and on the estimated folding energies of those regions. Tree-based statistical methods based on primary sequence data surrounding edited/non-edited sites and estimates of free energy of folding yield models with optimistic re-substitution-based estimates of ~0.71 accuracy, ~0.64 sensitivity, and ~0.88 specificity. Random forest analysis yielded better models and more exact performance estimates with ~0.74 accuracy, ~0.72 sensitivity, and ~0.81 specificity for the combined observations.
Simple models do moderately well in predicting which cytidines will be edited to uridines, and provide the first quantitative predictive models for RNA edited sites in plant mitochondria. Our analysis shows that the identity of the nucleotide -1 to the edited C and the estimated free energy of folding for a 41 nt region surrounding the edited C are the most important variables that distinguish most edited from non-edited sites. However, the results suggest that primary sequence data and simple free energy of folding calculations alone are insufficient to make highly accurate predictions.
U-snRNA genes in higher plants contain two essential promoter elements, the USE with sequence RTCCCACATCG and the TATA-like box, positioned in the -70 and -30 regions, respectively. Using an oligodeoxynucleotide containing the USE motif and oligodeoxynucleotides specific for the intragenic regions conserved in U-snRNAs, several sequences encoding U6 and U3 snRNAs were determined by polymerase chain reaction (PCR) amplification of Arabidopsis thaliana and tobacco genomic DNAs. This method provides a simple and rapid procedure for characterisation of plant U-snRNA genes and their promoters. It could also be used for the characterisation of other genes containing conserved upstream promoter elements. PCR-derived fragments were used as probes for the isolation of the U3 snRNA genes from a genomic library of Arabidopsis. Two isolated U3 genes were shown to be active when transfected into protoplasts of Nicotiana plumbaginifolia. Both U3 genes contain the USE and TATA-like upstream elements located in similar positions to the U6 genes of Arabidopsis. The encoded Arabidopsis U3 snRNAs can be folded into a secondary structure which is more similar to that of U3 RNAs from lower eukaryotes rather than from metazoa.
Genome sequences can be conceptualized as arrangements of motifs or words. The frequencies and positional distributions of these words within particular non-coding genomic segments provide important insights into how the words function in processes such as mRNA stability and regulation of gene expression.
Using an enumerative word discovery approach, we investigated the frequencies and positional distributions of all 65,536 different 8-letter words in the genome of Arabidopsis thaliana. Focusing on promoter regions, introns, and 3' and 5' untranslated regions (3'UTRs and 5'UTRs), we compared word frequencies in these segments to genome-wide frequencies. The statistically interesting words in each segment were clustered with similar words to generate motif logos. We investigated whether words were clustered at particular locations or were distributed randomly within each genomic segment, and we classified the words using gene expression information from public repositories. Finally, we investigated whether particular sets of words appeared together more frequently than others.
Our studies provide a detailed view of the word composition of several segments of the non-coding portion of the Arabidopsis genome. Each segment contains a unique word-based signature. The respective signatures consist of the sets of enriched words, 'unwords', and word pairs within a segment, as well as the preferential locations and functional classifications for the signature words. Additionally, the positional distributions of enriched words within the segments highlight possible functional elements, and the co-associations of words in promoter regions likely represent the formation of higher order regulatory modules. This work is an important step toward fully cataloguing the functional elements of the Arabidopsis genome.
Previous studies have demonstrated that telomeres in somatic cells are not randomly distributed at the end of the chromosomes. We hypothesize that these chromosome arm specific differences in telomere length (the telomere length pattern) may be actively maintained. In this study we investigate the existence and maintenance of the telomere length pattern in stem cells. For this aim we studied telomere length in primary human mesenchymal stem cells (hMSC) and their telomerase-immortalised counterpart (hMSC-telo1) during extended proliferation as well as after irradiation. Telomere lengths were measured using Fluorescence In Situ Hybridization (Q-FISH).
A telomere length pattern was found to exist in primary hMSC's as well as in hMSC-telo1. This pattern is similar to what was previously found in lymphocytes and fibroblasts. The cells were then exposed to a high dose of ionizing radiation. Irradiation caused profound changes in chromosome specific telomere lengths, effectively destroying the telomere length pattern. Following long term culturing after irradiation, a telomere length pattern was found to re-emerge. However, the new telomere length pattern did not resemble the telomere length pattern observed before irradiation.
Our findings indicate that a telomere length pattern does exist in mesenchymal stem cells and that the pattern is not actively re-established after destruction by irradiation.
Highly polymorphic and transferable microsatellites (SSRs) are important for comparative genomics, genome analysis and
phylogenetic studies. Development of novel species-specific microsatellite markers remains a costly and labor-intensive project.
Therefore, interest has been shifted from genomic to genic markers owing to their high inter-species transferability as they are
developed from conserved coding regions of the genome. This study concentrates on comparative analysis of genic microsatellites
in nine important legume (Arachis hypogaea, Cajanus cajan, Cicer arietinum, Glycine max, Lotus japonicus, Medicago truncatula, Phaseolus
vulgaris, Pisum sativum and Vigna unguiculata) and two model plant species (Oryza sativa and Arabidopsis thaliana). Screening of a
total of 228090 putative unique sequences spanning 219610522 bp using a microsatellite search tool, MISA, identified 12.18% of the
unigenes containing 36248 microsatellite motifs excluding mononucleotide repeats. Frequency of legume unigene-derived SSRs
was one SSR in every 6.0 kb of analyzed sequences. The trinucleotide repeats were predominant in all the unigenes with the
exception of C. cajan, which showed prevalence of dinucleotide repeats over trinucleotide repeats. Dinucleotide repeats along with
trinucleotides counted for more than 90% of the total microsatellites. Among dinucleotide and trinucleotide repeats, AG and AAG
motifs, respectively, were the most frequent. Microsatellite positive chickpea unigenes were assigned Gene Ontology (GO) terms to
identify the possible role of unigenes in various molecular and biological functions. These unigene based microsatellite markers
will prove valuable for recording allelic variance across germplasm collections, gene tagging and searching for putative candidate
Microsatellites; SSRs; Unigenes; Legumes; Functional annotation
A video abstract by the authors of this paper is available. video-abstract10324.mov
Insertion of transposable elements (TEs) into introns can lead to their activation as alternatively spliced cassette exons, an event called exonization which can enrich the complexity of transcriptomes and proteomes. Previously, we performed the first experimental assessment of TE exonization by inserting a Ds element into each intron of the rice epsps gene. Exonization of Ds in plants was biased toward providing splice donor sites from the beginning of the inserted Ds sequence. Additionally, Ds inserted in the reverse direction resulted in a continuous splice donor consensus region by offering 4 donor sites in the same intron. The current study involved genome-wide computational analysis of Ds exonization events in the dicot Arabidopsis thaliana and the monocot Oryza sativa (rice). Up to 71% of the exonized transcripts were putative targets for the nonsense-mediated decay (NMD) pathway. The insertion patterns of Ds and the polymorphic splice donor sites increased the transcripts and subsequent protein isoforms. Protein isoforms contain protein sequence due to unspliced intron-TE region and/or a shift of the reading frame. The number of interior protein isoforms would be twice that of C-terminal isoforms, on average. TE exonization provides a promising way for functional expansion of the plant proteome.
Ac/Ds transposon; exonization; alternative splicing; nonsense-mediated decay pathway
High gene numbers in plant genomes reflect polyploidy and major gene duplication events. Oryza sativa, cultivated rice, is a diploid monocotyledonous species with a ~390 Mb genome that has undergone segmental duplication of a substantial portion of its genome. This, coupled with other genetic events such as tandem duplications, has resulted in a substantial number of its genes, and resulting proteins, occurring in paralogous families.
Using a computational pipeline that utilizes Pfam and novel protein domains, we characterized paralogous families in rice and compared these with paralogous families in the model dicotyledonous diploid species, Arabidopsis thaliana. Arabidopsis, which has undergone genome duplication as well, has a substantially smaller genome (~120 Mb) and gene complement compared to rice. Overall, 53% and 68% of the non-transposable element-related rice and Arabidopsis proteins could be classified into paralogous protein families, respectively. Singleton and paralogous family genes differed substantially in their likelihood of encoding a protein of known or putative function; 26% and 66% of singleton genes compared to 73% and 96% of the paralogous family genes encode a known or putative protein in rice and Arabidopsis, respectively. Furthermore, a major skew in the distribution of specific gene function was observed; a total of 17 Gene Ontology categories in both rice and Arabidopsis were statistically significant in their differential distribution between paralogous family and singleton proteins. In contrast to mammalian organisms, we found that duplicated genes in rice and Arabidopsis tend to have more alternative splice forms. Using data from Massively Parallel Signature Sequencing, we show that a significant portion of the duplicated genes in rice show divergent expression although a correlation between sequence divergence and correlation of expression could be seen in very young genes.
Collectively, these data suggest that while co-regulation and conserved function are present in some paralogous protein family members, evolutionary pressures have resulted in functional divergence with differential expression patterns.
Regulation of chlorophyll metabolism comprises strong transcriptional control together with a range of post-translational mechanisms during chloroplast biogenesis. Recently we reported that chlorophyll biosynthesis in Arabidopsis thaliana roots is regulated by auxin/cytokinin signaling via the combination of two transcription factors, LONG-HYPOCOTYL5 (HY5) and GOLDEN2-LIKE2 (GLK2). In this study, we examined the involvement of cis-elements in the expression of chlorophyll biosynthesis genes. Searches for predicted cis-elements in key chlorophyll biosynthesis genes and their co-expressed genes revealed coexistence of the G-box motif and the CCAATC motif, which may be targeted by HY5 and GLK factors, respectively, in their promoter regions. Deletion of the G-box from the promoter of the CHLH gene encoding the H subunit of Mg-chelatase resulted in the absence of its expression in roots but not in shoots, showing a differing involvement of the G-box in CHLH expression between shoots and roots. Our data suggest that transcription factors and cis-elements participating chlorophyll biosynthesis are substantially changed during organ differentiation, which may be linked to the differentiation of plastids.
cis-element; chlorophyll biosynthesis; co-expression network; photosynthesis; transcriptional regulation
Histone post-translational modifications (HPTMs) including acetylation and methylation have been recognized as playing a crucial role in epigenetic regulation of plant growth and development. Although Solanum lycopersicum is a dicot model plant as well as an important crop, systematic analysis and expression profiling of histone modifier genes (HMs) in tomato are sketchy.
Based on recently released tomato whole-genome sequences, we identified in silico 32 histone acetyltransferases (HATs), 15 histone deacetylases (HDACs), 52 histone methytransferases (HMTs) and 26 histone demethylases (HDMs), and compared them with those detected in Arabidopsis (Arabidopsis thaliana), maize (Zea mays) and rice (Oryza sativa) orthologs. Comprehensive analysis of the protein domain architecture and phylogeny revealed the presence of non-canonical motifs and new domain combinations, thereby suggesting for HATs the existence of a new family in plants. Due to species-specific diversification during evolutionary history tomato has fewer HMs than Arabidopsis. The transcription profiles of HMs within tomato organs revealed a broad functional role for some HMs and a more specific activity for others, suggesting key HM regulators in tomato development. Finally, we explored S. pennellii introgression lines (ILs) and integrated the map position of HMs, their expression profiles and the phenotype of ILs. We thereby proved that the strategy was useful to identify HM candidates involved in carotenoid biosynthesis in tomato fruits.
In this study, we reveal the structure, phylogeny and spatial expression of members belonging to the classical families of HMs in tomato. We provide a framework for gene discovery and functional investigation of HMs in other Solanaceae species.
Solanum lycopersicum; Epigenetics; Development
MicroRNAs (miRNAs) are a major class of small non-coding RNAs that act as negative regulators at the post-transcriptional level in animals and plants. In this study, all known miRNAs in four plant species (Arabidopsis thaliana, Populus trichocarpa, Oryza sativa and Sorghum bicolor) have been analyzed, using a combination of computational and comparative genomic approaches, to systematically identify and characterize the miRNAs that were derived from repetitive elements and duplication events. The study provides a complete mapping, at the genome scale, of all the miRNAs found on repetitive elements in the four test plant species. Significant differences between repetitive element-related miRNAs and non-repeat-derived miRNAs were observed for many characteristics, including their location in protein-coding and intergenic regions in genomes, their conservation in plant species, sequence length of their hairpin precursors, base composition of their hairpin precursors and the minimum free energy of their hairpin structures. Further analysis showed that a considerable number of miRNA families in the four test plant species arose from either tandem duplication events, segmental duplication events or a combination of the two. However, comparative analysis suggested that the contribution made by these two duplication events differed greatly between the perennial tree species tested and the other three annual species. The expansion of miRNA families in A. thaliana, O. sativa and S. bicolor are more likely to occur as a result of tandem duplication events than from segmental duplications. In contrast, genomic segmental duplications contributed significantly more to the expansion of miRNA families in P. trichocarpa than did tandem duplication events. Taken together, this study has successfully characterized miRNAs derived from repetitive elements and duplication events at the genome scale and provides comprehensive knowledge and deeper insight into the origins and evolution of miRNAs in plants.
Telomeres, the essential terminal regions of linear eukaryotic chromosomes, consist of G-rich DNA repeats bound by a plethora of associated proteins. While the general pathways of telomere maintenance are evolutionarily conserved, individual telomere complex components show remarkable variation between eukaryotic lineages and even within closely related species. The recent genome sequencing of the lycophyte Selaginella moellendorffii and the availability of an ever-increasing number of flowering plant genomes provides a unique opportunity to evaluate the molecular and functional evolution of telomere components from the early evolving non-seed plants to the more developmentally advanced angiosperms. Here we analyzed telomere sequence in S. moellendorffii and found it to consist of TTTAGGG repeats, typical of most plants. Telomere tracts in S. moellendorffii range from 1 to 5.5 kb, closely resembling Arabidopsis thaliana. We identified several S. moellendorffii genes encoding sequence homologs of proteins involved in telomere maintenance in other organisms, including CST complex components and the telomere-binding proteins, POT1 and the TRFL family. Notable sequence similarities and differences were uncovered among the telomere-related genes in some of the plant lineages. Taken together, the data indicate that comparative analysis of the telomere complex in early diverging land plants such as S. moellendorffii and green algae will yield important insights into the evolution of telomeres and their protein constituents.
telomere; Selaginella; POT1; TRFL1; CST complex
We have designed and implemented a web-based database system, called PlantQTL-GE, to facilitate quantitatine traits locus (QTL) based candidate gene identification and gene function analysis. We collected a large number of genes, gene expression information in microarray data and expressed sequence tags (ESTs) and genetic markers from multiple sources of Oryza sativa and Arabidopsis thaliana. The system integrates these diverse data sources and has a uniform web interface for easy access. It supports QTL queries specifying QTL marker intervals or genomic loci, and displays, on rice or Arabidopsis genome, known genes, microarray data, ESTs and candidate genes and similar putative genes in the other plant. Candidate genes in QTL intervals are further annotated based on matching ESTs, microarray gene expression data and cis-elements in regulatory sequences. The system is freely available at .
Intrinsically disordered proteins, found in all living organisms, are essential for basic cellular functions and complement the function of ordered proteins. It has been shown that protein disorder is linked to the G + C content of the genome. Furthermore, recent investigations have suggested that the evolutionary dynamics of the plant nucleus adds disordered segments to open reading frames alike, and these segments are not necessarily conserved among orthologous genes.
In the present work the distribution of intrinsically disordered proteins along the chromosomes of several representative plants was analyzed. The reported results support a non-random distribution of disordered proteins along the chromosomes of Arabidopsis thaliana and Oryza sativa, two model eudicot and monocot plant species, respectively. In fact, for most chromosomes positive correlations between the frequency of disordered segments of 30+ amino acids and both recombination rates and G + C content were observed.
These analyses demonstrate that the presence of disordered segments among plant proteins is associated with the rates of genetic recombination of their encoding genes. Altogether, these findings suggest that high recombination rates, as well as chromosomal rearrangements, could induce disordered segments in proteins during evolution.
Chromosome; Evolution; Intrinsically disordered proteins; Orthologues; Plant genome; Recombination rate
Dof proteins are a family of plant-specific transcription factors that contain a particular class of zinc-finger DNA-binding domain. Members of this family have been found to play diverse roles in gene regulation of processes restricted to the plants. The completed genome sequences of rice and Arabidopsis constitute a valuable resource for comparative genomic analyses, since they are representatives of the two major evolutionary lineages within the angiosperms. In this framework, the identification of phylogenetic relationships among Dof proteins in these species is a fundamental step to unravel functionality of new and yet uncharacterised genes belonging to this group.
We identified 30 different Dof genes in the rice Oryza sativa genome and performed a phylogenetic analysis of a complete collection of the 36-reported Arabidopsis thaliana and the rice Dof transcription factors identified herein. This analysis led to a classification into four major clusters of orthologous genes and showed gene loss and duplication events in Arabidopsis and rice, that occurred before and after the last common ancestor of the two species.
According to our analysis, the Dof gene family in angiosperms is organized in four major clusters of orthologous genes or subfamilies. The proposed clusters of orthology and their further analysis suggest the existence of monocot specific genes and invite to explore their functionality in relation to the distinct physiological characteristics of these evolutionary groups.