Diverse plant genome sequencing projects coupled with powerful bioinformatics tools have facilitated massive data analysis to construct specialized databases classified according to cellular function. However, there are still a considerable number of genes encoding proteins whose function has not yet been characterized. Included in this category are small proteins (SPs, 30–150 amino acids) encoded by short open reading frames (sORFs). SPs play important roles in plant physiology, growth, and development. Unfortunately, protocols focused on the genome-wide identification and characterization of sORFs are scarce or remain poorly implemented. As a result, these genes are underrepresented in many genome annotations. In this work, we exploited publicly available genome sequences of Phaseolus vulgaris, Medicago truncatula, Glycine max, and Lotus japonicus to analyze the abundance of annotated SPs in plant legumes. Our strategy to uncover bona fide sORFs at the genome level was centered in bioinformatics analysis of characteristics such as evidence of expression (transcription), presence of known protein regions or domains, and identification of orthologous genes in the genomes explored. We collected 6170, 10,461, 30,521, and 23,599 putative sORFs from P. vulgaris, G. max, M. truncatula, and L. japonicus genomes, respectively. Expressed sequence tags (ESTs) available in the DFCI Gene Index database provided evidence that ~one-third of the predicted legume sORFs are expressed. Most potential SPs have a counterpart in a different plant species and counterpart regions or domains in larger proteins. Potential functional sORFs were also classified according to a reduced set of GO categories, and the expression of 13 of them during P. vulgaris nodule ontogeny was confirmed by qPCR. This analysis provides a collection of sORFs that potentially encode for meaningful SPs, and offers the possibility of their further functional evaluation.
gene annotation; legume genomes; short open reading frames
Legumes play a vital role in maintaining the nitrogen cycle of the biosphere. They conduct symbiotic nitrogen fixation through endosymbiotic relationships with bacteria in root nodules. However, this and other characteristics of legumes, including mycorrhization, compound leaf development and profuse secondary metabolism, are absent in the typical model plant Arabidopsis thaliana. We present LegumeIP (http://plantgrn.noble.org/LegumeIP/), an integrative database for comparative genomics and transcriptomics of model legumes, for studying gene function and genome evolution in legumes. LegumeIP compiles gene and gene family information, syntenic and phylogenetic context and tissue-specific transcriptomic profiles. The database holds the genomic sequences of three model legumes, Medicago truncatula, Glycine max and Lotus japonicus plus two reference plant species, A. thaliana and Populus trichocarpa, with annotations based on UniProt, InterProScan, Gene Ontology and the Kyoto Encyclopedia of Genes and Genomes databases. LegumeIP also contains large-scale microarray and RNA-Seq-based gene expression data. Our new database is capable of systematic synteny analysis across M. truncatula, G. max, L. japonicas and A. thaliana, as well as construction and phylogenetic analysis of gene families across the five hosted species. Finally, LegumeIP provides comprehensive search and visualization tools that enable flexible queries based on gene annotation, gene family, synteny and relative gene expression.
MicroRNAs (miRNA) are ∼21 nucleotide-long non-coding small RNAs, which function as post-transcriptional regulators in eukaryotes. miRNAs play essential roles in regulating plant growth and development. In recent years, research into the mechanism and consequences of miRNA action has made great progress. With whole genome sequence available in such plants as Arabidopsis thaliana, Oryza sativa, Populus trichocarpa, Glycine max, etc., it is desirable to develop a plant miRNA database through the integration of large amounts of information about publicly deposited miRNA data. The plant miRNA database (PMRD) integrates available plant miRNA data deposited in public databases, gleaned from the recent literature, and data generated in-house. This database contains sequence information, secondary structure, target genes, expression profiles and a genome browser. In total, there are 8433 miRNAs collected from 121 plant species in PMRD, including model plants and major crops such as Arabidopsis, rice, wheat, soybean, maize, sorghum, barley, etc. For Arabidopsis, rice, poplar, soybean, cotton, medicago and maize, we included the possible target genes for each miRNA with a predicted interaction site in the database. Furthermore, we provided miRNA expression profiles in the PMRD, including our local rice oxidative stress related microarray data (LC Sciences miRPlants_10.1) and the recently published microarray data for poplar, Arabidopsis, tomato, maize and rice. The PMRD database was constructed by open source technology utilizing a user-friendly web interface, and multiple search tools. The PMRD is freely available at http://bioinformatics.cau.edu.cn/PMRD. We expect PMRD to be a useful tool for scientists in the miRNA field in order to study the function of miRNAs and their target genes, especially in model plants and major crops.
The PlantTribes database (http://fgp.huck.psu.edu/tribe.html) is a plant gene family database based on the inferred proteomes of five sequenced plant species: Arabidopsis thaliana, Carica papaya, Medicago truncatula, Oryza sativa and Populus trichocarpa. We used the graph-based clustering algorithm MCL [Van Dongen (Technical Report INS-R0010 2000) and Enright et al. (Nucleic Acids Res. 2002; 30: 1575–1584)] to classify all of these species’ protein-coding genes into putative gene families, called tribes, using three clustering stringencies (low, medium and high). For all tribes, we have generated protein and DNA alignments and maximum-likelihood phylogenetic trees. A parallel database of microarray experimental results is linked to the genes, which lets researchers identify groups of related genes and their expression patterns. Unified nomenclatures were developed, and tribes can be related to traditional gene families and conserved domain identifiers. SuperTribes, constructed through a second iteration of MCL clustering, connect distant, but potentially related gene clusters. The global classification of nearly 200 000 plant proteins was used as a scaffold for sorting ∼4 million additional cDNA sequences from over 200 plant species. All data and analyses are accessible through a flexible interface allowing users to explore the classification, to place query sequences within the classification, and to download results for further study.
Genome-wide identification and phylogenetic and syntenic comparison were performed for the genes responsible for phenylalanine ammonia lyase (PAL) and peroxidase A (POX A) enzymes in nine plant species representing very diverse groups like legumes (Glycine max and Medicago truncatula), fruits (Vitis vinifera), cereals (Sorghum bicolor, Zea mays, and Oryza sativa), trees (Populus trichocarpa), and model dicot (Arabidopsis thaliana) and monocot (Brachypodium distachyon) species. A total of 87 and 1045 genes in PAL and POX A gene families, respectively, have been identified in these species. The phylogenetic and syntenic comparison along with motif distributions shows a high degree of conservation of PAL genes, suggesting that these genes may predate monocot/eudicot divergence. The POX A family genes, present in clusters at the subtelomeric regions of chromosomes, might be evolving and expanding with higher rate than the PAL gene family. Our analysis showed that during the expansion of POX A gene family, many groups and subgroups have evolved, resulting in a high level of functional divergence among monocots and dicots. These results will act as a first step toward the understanding of monocot/eudicot evolution and functional characterization of these gene families in the future.
Rates of molecular evolution vary widely among species. While significant deviations from molecular clock have been found in many taxa, effects of life histories on molecular evolution are not fully understood. In plants, annual/perennial life history traits have long been suspected to influence the evolutionary rates at the molecular level. To date, however, the number of genes investigated on this subject is limited and the conclusions are mixed. To evaluate the possible heterogeneity in evolutionary rates between annual and perennial plants at the genomic level, we investigated 85 nuclear housekeeping genes, 10 non-housekeeping families, and 34 chloroplast genes using the genomic data from model plants including Arabidopsis thaliana and Medicago truncatula for annuals and grape (Vitis vinifera) and popular (Populus trichocarpa) for perennials.
According to the cross-comparisons among the four species, 74-82% of the nuclear genes and 71-97% of the chloroplast genes suggested higher rates of molecular evolution in the two annuals than those in the two perennials. The significant heterogeneity in evolutionary rate between annuals and perennials was consistently found both in nonsynonymous sites and synonymous sites. While a linear correlation of evolutionary rates in orthologous genes between species was observed in nonsynonymous sites, the correlation was weak or invisible in synonymous sites. This tendency was clearer in nuclear genes than in chloroplast genes, in which the overall evolutionary rate was small. The slope of the regression line was consistently lower than unity, further confirming the higher evolutionary rate in annuals at the genomic level.
The higher evolutionary rate in annuals than in perennials appears to be a universal phenomenon both in nuclear and chloroplast genomes in the four dicot model plants we investigated. Therefore, such heterogeneity in evolutionary rate should result from factors that have genome-wide influence, most likely those associated with annual/perennial life history. Although we acknowledge current limitations of this kind of study, mainly due to a small sample size available and a distant taxonomic relationship of the model organisms, our results indicate that the genome-wide survey is a promising approach toward further understanding of the mechanism determining the molecular evolutionary rate at the genomic level.
NPR1 is a gene of central importance in enabling plants to resist microbial attack. Therefore, knowledge of nearby genes is important for genome analysis and possibly for improving disease resistance. In this study, systematic DNA sequence analysis, gene annotation, and protein BLASTs were performed to determine genes near the NPR1 gene in Beta vulgaris L., Medicago truncatula Gaertn, and Populus trichocarpa Torr. & Gray, and to access predicted function. Microsynteny was discovered for NPR1 with genes CaMP, encoding a chloroplast-targeted signal calmodulin-binding protein, and CK1PK, a CK1-class protein kinase. Conserved microsynteny of NPR1, CaMP, and CK1PK in three diverse species of eudicots suggests maintenance during evolution by positive selection for close proximity. Perhaps close physical linkage contributes to coordinated expression of these particular genes that may control critically important processes including nuclear events and signal transduction.
Detailed and comprehensive genome annotation can be considered a prerequisite for effective analysis and interpretation of omics data. As such, Gene Ontology (GO) annotation has become a well accepted framework for functional annotation. The genus Aspergillus comprises fungal species that are important model organisms, plant and human pathogens as well as industrial workhorses. However, GO annotation based on both computational predictions and extended manual curation has so far only been available for one of its species, namely A. nidulans.
Based on protein homology, we mapped 97% of the 3,498 GO annotated A. nidulans genes to at least one of seven other Aspergillus species: A. niger, A. fumigatus, A. flavus, A. clavatus, A. terreus, A. oryzae and Neosartorya fischeri. GO annotation files compatible with diverse publicly available tools have been generated and deposited online. To further improve their accessibility, we developed a web application for GO enrichment analysis named FetGOat and integrated GO annotations for all Aspergillus species with public genome sequences. Both the annotation files and the web application FetGOat are accessible via the Broad Institute's website (http://www.broadinstitute.org/fetgoat/index.html). To demonstrate the value of those new resources for functional analysis of omics data for the genus Aspergillus, we performed two case studies analyzing microarray data recently published for A. nidulans, A. niger and A. oryzae.
We mapped A. nidulans GO annotation to seven other Aspergilli. By depositing the newly mapped GO annotation online as well as integrating it into the web tool FetGOat, we provide new, valuable and easily accessible resources for omics data analysis and interpretation for the genus Aspergillus. Furthermore, we have given a general example of how a well annotated genome can help improving GO annotation of related species to subsequently facilitate the interpretation of omics data.
Plant polyphenol oxidases (PPOs) are enzymes that typically use molecular oxygen to oxidize ortho-diphenols to ortho-quinones. These commonly cause browning reactions following tissue damage, and may be important in plant defense. Some PPOs function as hydroxylases or in cross-linking reactions, but in most plants their physiological roles are not known. To better understand the importance of PPOs in the plant kingdom, we surveyed PPO gene families in 25 sequenced genomes from chlorophytes, bryophytes, lycophytes, and flowering plants. The PPO genes were then analyzed in silico for gene structure, phylogenetic relationships, and targeting signals.
Many previously uncharacterized PPO genes were uncovered. The moss, Physcomitrella patens, contained 13 PPO genes and Selaginella moellendorffii (spike moss) and Glycine max (soybean) each had 11 genes. Populus trichocarpa (poplar) contained a highly diversified gene family with 11 PPO genes, but several flowering plants had only a single PPO gene. By contrast, no PPO-like sequences were identified in several chlorophyte (green algae) genomes or Arabidopsis (A. lyrata and A. thaliana). We found that many PPOs contained one or two introns often near the 3’ terminus. Furthermore, N-terminal amino acid sequence analysis using ChloroP and TargetP 1.1 predicted that several putative PPOs are synthesized via the secretory pathway, a unique finding as most PPOs are predicted to be chloroplast proteins. Phylogenetic reconstruction of these sequences revealed that large PPO gene repertoires in some species are mostly a consequence of independent bursts of gene duplication, while the lineage leading to Arabidopsis must have lost all PPO genes.
Our survey identified PPOs in gene families of varying sizes in all land plants except in the genus Arabidopsis. While we found variation in intron numbers and positions, overall PPO gene structure is congruent with the phylogenetic relationships based on primary sequence data. The dynamic nature of this gene family differentiates PPO from other oxidative enzymes, and is consistent with a protein important for a diversity of functions relating to environmental adaptation.
Sinorhizobium meliloti is a symbiotic soil bacterium of the alphaproteobacterial subdivision. Like other rhizobia, S. meliloti induces nitrogen-fixing root nodules on leguminous plants. This is an ecologically and economically important interaction, because plants engaged in symbiosis with rhizobia can grow without exogenous nitrogen fertilizers. The S. meliloti-Medicago truncatula (barrel medic) association is an important symbiosis model. The S. meliloti genome was published in 2001, and the Medicago truncatula genome currently is being sequenced. Many new resources and data have been made available since the original S. meliloti genome annotation and an update was needed. In June 2008, we submitted our annotation update to the EMBL and NCBI databases. Here we describe this new annotation and a new web-based portal RhizoGATE. About 1000 annotation updates were made; these included assigning functions to 313 putative proteins, assigning EC numbers to 431 proteins, and identifying 86 new putative genes. RhizoGATE incorporates the new annotion with the S. meliloti GenDB project, a platform that allows annotation updates in real time. Locations of transposon insertions, plasmid integrations, and array probe sequences are available in the GenDB project. RhizoGATE employs the EMMA platform for management and analysis of transcriptome data and the IGetDB data warehouse to integrate a variety of heterogeneous external data sources.
Rhizobiales; α-proteobacteria; symbiotic nitrogen fixation; Medicago; symbiosis
CharProtDB (http://www.jcvi.org/charprotdb/) is a curated database of biochemically characterized proteins. It provides a source of direct rather than transitive assignments of function, designed to support automated annotation pipelines. The initial data set in CharProtDB was collected through manual literature curation over the years by analysts at the J. Craig Venter Institute (JCVI) [formerly The Institute of Genomic Research (TIGR)] as part of their prokaryotic genome sequencing projects. The CharProtDB has been expanded by import of selected records from publicly available protein collections whose biocuration indicated direct rather than homology-based assignment of function. Annotations in CharProtDB include gene name, symbol and various controlled vocabulary terms, including Gene Ontology terms, Enzyme Commission number and TransportDB accession. Each annotation is referenced with the source; ideally a journal reference, or, if imported and lacking one, the original database source.
PlantGDB (http://www.plantgdb.org/) is a genomics database encompassing sequence data for green plants (Viridiplantae). PlantGDB provides annotated transcript assemblies for >100 plant species, with transcripts mapped to their cognate genomic context where available, integrated with a variety of sequence analysis tools and web services. For 14 plant species with emerging or complete genome sequence, PlantGDB's genome browsers (xGDB) serve as a graphical interface for viewing, evaluating and annotating transcript and protein alignments to chromosome or bacterial artificial chromosome (BAC)-based genome assemblies. Annotation is facilitated by the integrated yrGATE module for community curation of gene models. Novel web services at PlantGDB include Tracembler, an iterative alignment tool that generates contigs from GenBank trace file data and BioExtract Server, a web-based server for executing custom sequence analysis workflows. PlantGDB also hosts a plant genomics research outreach portal (PGROP) that facilitates access to a large number of resources for research and training.
F-box proteins constitute a large gene family that regulates processes from hormone signaling to stress response. F-box proteins are the substrate recognition modules of SCF E3 ubiquitin ligases. Here we report very distinct trends in family size, duplication, synteny and transcription of F-box genes in two nitrogen-fixing legumes, Glycine max (soybean) and Medicago truncatula (alfafa). While the soybean FBX genes emerged mainly through segmental duplications (including whole-genome duplications), M. truncatula genome is dominated by locally-duplicated (tandem) F-box genes. Many of these young FBX genes evolved complex transcriptional patterns, including preferential transcription in different tissues, suggesting that they have probably been recruited to important biochemical pathways (e.g. nodulation and seed development).
The PLAnt co-EXpression database (PLANEX) is a new internet-based database for plant gene analysis. PLANEX (http://planex.plantbioinformatics.org) contains publicly available GeneChip data obtained from the Gene Expression Omnibus (GEO) of the National Center for Biotechnology Information (NCBI). PLANEX is a genome-wide co-expression database, which allows for the functional identification of genes from a wide variety of experimental designs. It can be used for the characterization of genes for functional identification and analysis of a gene’s dependency among other genes. Gene co-expression databases have been developed for other species, but gene co-expression information for plants is currently limited.
We constructed PLANEX as a list of co-expressed genes and functional annotations for Arabidopsis thaliana, Glycine max, Hordeum vulgare, Oryza sativa, Solanum lycopersicum, Triticum aestivum, Vitis vinifera and Zea mays. PLANEX reports Pearson’s correlation coefficients (PCCs; r-values) that distribute from a gene of interest for a given microarray platform set corresponding to a particular organism. To support PCCs, PLANEX performs an enrichment test of Gene Ontology terms and Cohen’s Kappa value to compare functional similarity for all genes in the co-expression database. PLANEX draws a cluster network with co-expressed genes, which is estimated using the k-mean method. To construct PLANEX, a variety of datasets were interpreted by the IBM supercomputer Advanced Interactive eXecutive (AIX) in a supercomputing center.
PLANEX provides a correlation database, a cluster network and an interpretation of enrichment test results for eight plant species. A typical co-expressed gene generates lists of co-expression data that contain hundreds of genes of interest for enrichment analysis. Also, co-expressed genes can be identified and cataloged in terms of comparative genomics by using the ‘Co-expression gene compare’ feature. This type of analysis will help interpret experimental data and determine whether there is a common term among genes of interest.
Co-expression; Database; Pearson’s correlation coefficients; Clustering
Glycoside hydrolases cleave the bond between a carbohydrate and another carbohydrate, a protein, lipid or other moiety. Genes encoding glycoside hydrolases are found in a wide range of organisms, from archea to animals, and are relatively abundant in plant genomes. In plants, these enzymes are involved in diverse processes, including starch metabolism, defense, and cell-wall remodeling. Glycoside hydrolase genes have been previously cataloged for Oryza sativa (rice), the model dicotyledonous plant Arabidopsis thaliana, and the fast-growing tree Populus trichocarpa (poplar). To improve our understanding of glycoside hydrolases in plants generally and in grasses specifically, we annotated the glycoside hydrolase genes in the grasses Brachypodium distachyon (an emerging monocotyledonous model) and Sorghum bicolor (sorghum). We then compared the glycoside hydrolases across species, at the levels of the whole genome and individual glycoside hydrolase families.
We identified 356 glycoside hydrolase genes in Brachypodium and 404 in sorghum. The corresponding proteins fell into the same 34 families that are represented in rice, Arabidopsis, and poplar, helping to define a glycoside hydrolase family profile which may be common to flowering plants. For several glycoside hydrolase familes (GH5, GH13, GH18, GH19, GH28, and GH51), we present a detailed literature review together with an examination of the family structures. This analysis of individual families revealed both similarities and distinctions between monocots and eudicots, as well as between species. Shared evolutionary histories appear to be modified by lineage-specific expansions or deletions. Within GH families, the Brachypodium and sorghum proteins generally cluster with those from other monocots.
This work provides the foundation for further comparative and functional analyses of plant glycoside hydrolases. Defining the Brachypodium glycoside hydrolases sets the stage for Brachypodium to be a grass model for investigations of these enzymes and their diverse roles in planta. Insights gained from Brachypodium will inform translational research studies, with applications for the improvement of cereal crops and bioenergy grasses.
The comparative transcriptional analysis of highly syntenic regions in six different organ types between Medicago truncatula (barrel medic) and Glycine max (soybean), using nucleotide tiling microarrays, provides insights into genome organization and transcriptional regulation in these legume plants.
Legumes are the third largest family of flowering plants and are unique among crop species in their ability to fix atmospheric nitrogen. As a result of recent genome sequencing efforts, legumes are now one of a few plant families with extensive genomic and transcriptomic data available in multiple species. The unprecedented complexity and impending completeness of these data create opportunities for new approaches to discovery.
We report here a transcriptional analysis in six different organ types of syntenic regions totaling approximately 1 Mb between the legume plants barrel medic (Medicago truncatula) and soybean (Glycine max) using oligonucleotide tiling microarrays. This analysis detected transcription of over 80% of the predicted genes in both species. We also identified 499 and 660 transcriptionally active regions from barrel medic and soybean, respectively, over half of which locate outside of the predicted exons. We used the tiling array data to detect differential gene expression in the six examined organ types and found several genes that are preferentially expressed in the nodule. Further investigation revealed that some collinear genes exhibit different expression patterns between the two species.
These results demonstrate the utility of genome tiling microarrays in generating transcriptomic data to complement computational annotation of the newly available legume genome sequences. The tiling microarray data was further used to quantify gene expression levels in multiple organ types of two related legume species. Further development of this method should provide a new approach to comparative genomics aimed at elucidating genome organization and transcriptional regulation.
The root apical meristem of crop and model legume Medicago truncatula is a significantly different stem cell system to that of the widely studied model plant species Arabidopsis thaliana. In this study we used the Affymetrix Medicago GeneChip® to compare the transcriptomes of meristem and non-meristematic root to identify root meristem specific candidate genes.
Using mRNA from root meristem and non-meristem we were able to identify 324 and 363 transcripts differentially expressed from the two regions. With bioinformatics tools developed to functionally annotate the Medicago genome array we could identify significant changes in metabolism, signalling and the differentially expression of 55 transcription factors in meristematic and non-meristematic roots.
This is the first comprehensive analysis of M. truncatula root meristem cells using this genome array. This data will facilitate the mapping of regulatory and metabolic networks involved in the open root meristem of M. truncatula and provides candidates for functional analysis.
The Gene Ontology Consortium (GOC) is a community-based bioinformatics project that classifies gene product function through the use of structured controlled vocabularies. A fundamental application of the Gene Ontology (GO) is in the creation of gene product annotations, evidence-based associations between GO definitions and experimental or sequence-based analysis. Currently, the GOC disseminates 126 million annotations covering >374 000 species including all the kingdoms of life. This number includes two classes of GO annotations: those created manually by experienced biocurators reviewing the literature or by examination of biological data (1.1 million annotations covering 2226 species) and those generated computationally via automated methods. As manual annotations are often used to propagate functional predictions between related proteins within and between genomes, it is critical to provide accurate consistent manual annotations. Toward this goal, we present here the conventions defined by the GOC for the creation of manual annotation. This guide represents the best practices for manual annotation as established by the GOC project over the past 12 years. We hope this guide will encourage research communities to annotate gene products of their interest to enhance the corpus of GO annotations available to all.
The small heat shock proteins (sHSPs) are a diverse family of molecular chaperones. It is well established that these proteins are crucial components of the plant heat shock response. They also have important roles in other stress responses and in normal development. We have conducted a comparative sequence analysis of the sHSPs in three complete angiosperms genomes: Arabidopsis thaliana, Populus trichocarpa, and Oryza sativa. Our phylogenetic analysis has identified four additional plant sHSP subfamilies and thus has increased the number of plant sHSP subfamilies from 7 to 11. We have also identified a number of novel sHSP genes in each genome that lack close homologs in other genomes. Using publicly available gene expression data and predicted secondary structures, we have determined that the sHSPs in plants are far more diverse in sequence, expression profile, and in structure than had been previously known. Some of the newly identified subfamilies are not stress regulated, may not posses the highly conserved large oligomer structure, and may not even function as molecular chaperones. We found no consistent evolutionary patterns across the three species studied. For example, gene conversion was found among the sHSPs in O. sativa but not in A. thaliana or P. trichocarpa. Among the three species, P. trichocarpa had the most sHSPs. This was due to an expansion of the cytosolic I sHSPs that was not seen in the other two species. Our analysis indicates that the sHSPs are a dynamic protein family in angiosperms with unexpected levels of diversity.
Electronic supplementary material
The online version of this article (doi:10.1007/s12192-008-0023-7) contains supplementary material, which is available to authorized users.
The Comprehensive Phytopathogen Genomics Resource (CPGR) provides a web-based portal for plant pathologists and diagnosticians to view the genome and trancriptome sequence status of 806 bacterial, fungal, oomycete, nematode, viral and viroid plant pathogens. Tools are available to search and analyze annotated genome sequences of 74 bacterial, fungal and oomycete pathogens. Oomycete and fungal genomes are obtained directly from GenBank, whereas bacterial genome sequences are downloaded from the A Systematic Annotation Package (ASAP) database that provides curation of genomes using comparative approaches. Curated lists of bacterial genes relevant to pathogenicity and avirulence are also provided. The Plant Pathogen Transcript Assemblies Database provides annotated assemblies of the transcribed regions of 82 eukaryotic genomes from publicly available single pass Expressed Sequence Tags. Data-mining tools are provided along with tools to create candidate diagnostic markers, an emerging use for genomic sequence data in plant pathology. The Plant Pathogen Ribosomal DNA (rDNA) database is a resource for pathogens that lack genome or transcriptome data sets and contains 131 755 rDNA sequences from GenBank for 17 613 species identified as plant pathogens and related genera.
Database URL: http://cpgr.plantbiology.msu.edu.
Plant genomes contain several hundred defensin-like (DEFL) genes that encode short cysteine-rich proteins resembling defensins, which are well known antimicrobial polypeptides. Little is known about the expression patterns or functions of many DEFLs because most were discovered recently and hence are not well represented on standard microarrays. We designed a custom Affymetrix chip consisting of probe sets for 317 and 684 DEFLs from Arabidopsis thaliana and Medicago truncatula, respectively for cataloging DEFL expression in a variety of plant organs at different developmental stages and during symbiotic and pathogenic associations. The microarray analysis provided evidence for the transcription of 71% and 90% of the DEFLs identified in Arabidopsis and Medicago, respectively, including many of the recently annotated DEFL genes that previously lacked expression information. Both model plants contain a subset of DEFLs specifically expressed in seeds or fruits. A few DEFLs, including some plant defensins, were significantly up-regulated in Arabidopsis leaves inoculated with Alternaria brassicicola or Pseudomonas syringae pathogens. Among these, some were dependent on jasmonic acid signaling or were associated with specific types of immune responses. There were notable differences in DEFL gene expression patterns between Arabidopsis and Medicago, as the majority of Arabidopsis DEFLs were expressed in inflorescences, while only a few exhibited root-enhanced expression. By contrast, Medicago DEFLs were most prominently expressed in nitrogen-fixing root nodules. Thus, our data document salient differences in DEFL temporal and spatial expression between Arabidopsis and Medicago, suggesting distinct signaling routes and distinct roles for these proteins in the two plant species.
Enterobacter sp. 638 is an endophytic plant growth promoting gamma-proteobacterium that was isolated from the stem of poplar (Populus trichocarpa×deltoides cv. H11-11), a potentially important biofuel feed stock plant. The Enterobacter sp. 638 genome sequence reveals the presence of a 4,518,712 bp chromosome and a 157,749 bp plasmid (pENT638-1). Genome annotation and comparative genomics allowed the identification of an extended set of genes specific to the plant niche adaptation of this bacterium. This includes genes that code for putative proteins involved in survival in the rhizosphere (to cope with oxidative stress or uptake of nutrients released by plant roots), root adhesion (pili, adhesion, hemagglutinin, cellulose biosynthesis), colonization/establishment inside the plant (chemiotaxis, flagella, cellobiose phosphorylase), plant protection against fungal and bacterial infections (siderophore production and synthesis of the antimicrobial compounds 4-hydroxybenzoate and 2-phenylethanol), and improved poplar growth and development through the production of the phytohormones indole acetic acid, acetoin, and 2,3-butanediol. Metabolite analysis confirmed by quantitative RT–PCR showed that, the production of acetoin and 2,3-butanediol is induced by the presence of sucrose in the growth medium. Interestingly, both the genetic determinants required for sucrose metabolism and the synthesis of acetoin and 2,3-butanediol are clustered on a genomic island. These findings point to a close interaction between Enterobacter sp. 638 and its poplar host, where the availability of sucrose, a major plant sugar, affects the synthesis of plant growth promoting phytohormones by the endophytic bacterium. The availability of the genome sequence, combined with metabolome and transcriptome analysis, will provide a better understanding of the synergistic interactions between poplar and its growth promoting endophyte Enterobacter sp. 638. This information can be further exploited to improve establishment and sustainable production of poplar as an energy feedstock on marginal, non-agricultural soils using endophytic bacteria as growth promoting agents.
Poplar is considered as the model tree species for the production of lignocellulosic biomass destined for biofuel production. The plant growth promoting endophytic bacterium Enterobacter sp. 638 can improve the growth of poplar on marginal soils by as much as 40%. This prompted us to sequence the genome of this strain and, via comparative genomics, identify functions essential for the successful colonization and endophytic association with its poplar host. Analysis of the genome sequence, combined with metabolite analysis and quantitative PCR, pointed to a remarkable interaction between Enterobacter sp. 638 and its poplar host with the endophyte responsible for the production of a phytohormone, and a precursor for another that poplar is unable to synthesize, and where the production of the plant growth promoting compounds depended on the presence of plant synthesized compounds, such as sucrose, in the growth medium. Our results provide the basis to better understanding the synergistic interactions between poplar and Enterobacter sp. 638. This information can be further exploited to improve establishment and sustainable production of poplar on marginal, non-agricultural soils using endophytic bacteria such as Enterobacter sp. 638 as growth promoting agents.
The Consensus Coding Sequence (CCDS) collaboration involves curators at multiple centers with a goal of producing a conservative set of high quality, protein-coding region annotations for the human and mouse reference genome assemblies. The CCDS data set reflects a ‘gold standard’ definition of best supported protein annotations, and corresponding genes, which pass a standard series of quality assurance checks and are supported by manual curation. This data set supports use of genome annotation information by human and mouse researchers for effective experimental design, analysis and interpretation. The CCDS project consists of analysis of automated whole-genome annotation builds to identify identical CDS annotations, quality assurance testing and manual curation support. Identical CDS annotations are tracked with a CCDS identifier (ID) and any future change to the annotated CDS structure must be agreed upon by the collaborating members. CCDS curation guidelines were developed to address some aspects of curation in order to improve initial annotation consistency and to reduce time spent in discussing proposed annotation updates. Here, we present the current status of the CCDS database and details on our procedures to track and coordinate our efforts. We also present the relevant background and reasoning behind the curation standards that we have developed for CCDS database treatment of transcripts that are nonsense-mediated decay (NMD) candidates, for transcripts containing upstream open reading frames, for identifying the most likely translation start codons and for the annotation of readthrough transcripts. Examples are provided to illustrate the application of these guidelines.
Database URL: http://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi
Alternative splicing (AS) of genes is an efficient means of generating variation in protein structure and function. AS variation has been observed between tissues, cell types, and different treatments in non-woody plants such as Arabidopsis thaliana (Arabidopsis) and rice. However, little is known about AS patterns in wood-forming tissues and how much AS variation exists within plant populations.
Here we used high-throughput RNA sequencing to analyze the Populus trichocarpa (P. trichocarpa) xylem transcriptome in 20 individuals from different populations across much of its range in western North America. Deep transcriptome sequencing and mapping of reads to the P. trichocarpa reference genome identified a suite of xylem-expressed genes common to all accessions. Our analysis suggests that at least 36% of the xylem-expressed genes in P. trichocarpa are alternatively spliced. Extensive AS was observed in cell-wall biosynthesis related genes such as glycosyl transferases and C2H2 transcription factors. 27902 AS events were documented and most of these events were not conserved across individuals. Differences in isoform-specific read densities indicated that 7% and 13% of AS events showed significant differences between individuals within geographically separated southern and northern populations, a level that is in general agreement with AS variation in human populations.
This genome-wide analysis of alternative splicing reveals high levels of AS in P. trichocarpa and extensive inter-individual AS variation. We provide the most comprehensive analysis of AS in P. trichocarpa to date, which will serve as a valuable resource for the plant community to study transcriptome complexity and AS regulation during wood formation.
The Munich Information Center for Protein Sequences (MIPS at the Helmholtz Center for Environmental Health, Neuherberg, Germany) has many years of experience in providing annotated collections of biological data. Selected data sets of high relevance, such as model genomes, are subjected to careful manual curation, while the bulk of high-throughput data is annotated by automatic means. High-quality reference resources developed in the past and still actively maintained include Saccharomyces cerevisiae, Neurospora crassa and Arabidopsis thaliana genome databases as well as several protein interaction data sets (MPACT, MPPI and CORUM). More recent projects are PhenomiR, the database on microRNA-related phenotypes, and MIPS PlantsDB for integrative and comparative plant genome research. The interlinked resources SIMAP and PEDANT provide homology relationships as well as up-to-date and consistent annotation for 38 000 000 protein sequences. PPLIPS and CCancer are versatile tools for proteomics and functional genomics interfacing to a database of compilations from gene lists extracted from literature. A novel literature-mining tool, EXCERBT, gives access to structured information on classified relations between genes, proteins, phenotypes and diseases extracted from Medline abstracts by semantic analysis. All databases described here, as well as the detailed descriptions of our projects can be accessed through the MIPS WWW server (http://mips.helmholtz-muenchen.de).