|Home | About | Journals | Submit | Contact Us | Français|
Now in its 10th year, the Gramene database (http://www.gramene.org) has grown from its primary focus on rice, the first fully-sequenced grass genome, to become a resource for major model and crop plants including Arabidopsis, Brachypodium, maize, sorghum, poplar and grape in addition to several species of rice. Gramene began with the addition of an Ensembl genome browser and has expanded in the last decade to become a robust resource for plant genomics hosting a wide array of data sets including quantitative trait loci (QTL), metabolic pathways, genetic diversity, genes, proteins, germplasm, literature, ontologies and a fully-structured markers and sequences database integrated with genome browsers and maps from various published studies (genetic, physical, bin, etc.). In addition, Gramene now hosts a variety of web services including a Distributed Annotation Server (DAS), BLAST and a public MySQL database. Twice a year, Gramene releases a major build of the database and makes interim releases to correct errors or to make important updates to software and/or data.
Scientific advances in genomics promise to help plant breeders improve quality, pathogen resistance, and yield to meet the growing demands for food, fiber and biofuel, however, the ever-increasing volume of sequence data generated from reference genomes, expression studies and genome-wide genetic diversity studies present challenges to efficiently store, curate, analyze and retrieve such data. Gramene is a free online database for comparative plant genomics that began as an extension of the RiceGenes project (1,2) and now holds many large and varied data sets that are used extensively by thousands of plant researchers in the public and private sectors throughout the US, Asia and Europe. Through the application of standardized annotation methods, Gramene strives to create a resource that promotes cross-species analysis of both conserved and species-specific functions. Various ontologies are used to consistently describe plant anatomy (3), phenotype traits (4), genes (5), environment and taxonomy, and both computational and manual curation are employed to integrate data sets from various leading research projects on plants and public repositories such as GenBank. This article summarizes the changes to the website since the last publication in NAR 2008 (6), through the 31st release of the Gramene website in May 2010.
Plant biologists often enter Gramene through their species of interest, and genome browsers offer a direct window on specific regions and genes. Since Gramene’s inception, we have used the Ensembl genome browser (7). As of an interim release made shortly after our May 2010 release, Gramene uses Ensembl version 58 to visualize eight complete and several more partial plant genomes available from http://www.gramene.org/genome_browser/. Annotations held by Gramene include ab initio, evidence-based and community-generated gene predictions, repeat regions, and homology as well as cross-references to sequences in public databases, locations of quantitative trait loci (QTLs), locations of microarray probes, cross-references to sequences in public databases and genome variation such as SNPs and indels. The generation of genome annotations has been described previously (8). Each release of the database contains new and updated annotations. Since our last publication, Gramene has added or updated many plant genomes listed in Table 1.
In addition to the fully sequenced genomes, Gramene has worked with the Oryza Mapping Alignment Project (OMAP) (9) to visualize the physical map of O. rufipogon and the chromosome 3 short arms of O. brachyantha, O. nivara, O. rufipogon, O. barthii, O. glaberrima, O. minuta CC, O. officinalis and O. punctata. We have also now integrated variation data into our genomes such as a set of 71K single nucleotide polymorphisms (SNPs) from grape (10) in order to help researchers to determine the consequence of variation (Figure 1). The Arabidopsis variation database contains data from the screening of over 900 strains using the Affymetrix 250k Arabidopsis SNP chip (http://walnut.usc.edu/2010/data/250k-data-version-3.04) as well as SNP discovery data used to construct the 250K chip from 20 re-sequenced Arabidopsis lines (11).
In 2009, Gramene entered into a formal collaboration with the European Molecular Biology Laboratory - European Bioinformatics Institute (EMBL-EBI) and their Ensembl Genomes (EG) project (12) to create a common set of databases and annotations. Gramene has contributed all the ‘core’ databases for the fully sequenced plant genomes available at EG website (http://plants.ensembl.org), and both groups work on quality control, the integration of content, and the development of new features to share across all available plant genomes, thereby reducing redundancy of effort and standardizing analyses and visualization for the community.
Researchers are often use whole genome alignments (WGA) to explore conservation of chromosomal structure and gene structure. Gramene provides pre-computed whole genome and gene–gene alignments using a BLASTZ-net pairwise (13,14) whole genome alignment method implemented by Ensembl to analyze 12 plant genomes (http://www.gramene.org/info/docs/compara/analyses.html#blastz). Ensembl’s release 56 reintroduced multi-species comparative genome views driven by pair-wise alignments that had been absent from the Ensembl views for a year. Figure 2 gives an example showing homology from a 50 Kb region on O. sativa japonica chromosome 9 (central panel) showing and similar sized regions of Sorghum bicolor chromosome 2 (top panel) and Brachypodium distachyon chromosome 4 (bottom panel).
Comparative functional genomics allows researchers to trace evolutionary histories of genes and traits, and Gramene's Compara database adds a new level of tools to help researchers make inferences of function and strategies for gene annotation. Gramene uses the standard Ensembl GeneTree method (15) to generate gene trees and predict ortholog and paralog relationships between species. In the current release, the GeneTree database was rebuilt using five monocot genomes (O. sativa japonica, O. sativa indica, O. glaberrima, B. distachyon and S. bicolor), four dicot genomes (A. lyrata, A. thaliana, P. trichocarpa and V. vinifera) and five model metazoan genomes (Caenorhabditis elegans, Ciona intestinalis, Drosophila melanogaster, Homo sapiens and Saccharomyces cerevisiae). Figure 3 shows an example of the results of our latest gene tree build.
Synteny analysis allows researchers to infer ancestral locations of genes, and the finding of conserved synteny provides a measure of confidence that genes are true orthologs. In previous builds, Gramene used DNA-level whole genome alignments across its many hosted genomes, but, in the current release, Gramene implemented a new synteny analysis pipeline that makes use of gene ortholog assignments from our Compara GeneTree output as additional parameter to confirm homology. This avoids the complications associated with using WGA including spurious alignment and differential expansion and contraction within and between genomes. The new method was originally developed for the Maize Project (16) and is now implemented as a ‘runnable’ within our standardize genome annotation methods (17). To start the analysis, strictly collinear orthologs are mapped using DAGchainer (18) giving rise to the classification of high-confidence ‘syntenic:collinear’ gene-pairs. Next these mappings are used as anchor points to identify additional syntenic orthologs that may violate collinearity due to local rearrangements or assembly artifacts. This step is configured using a gene-index distance parameter, and its output defines near-collinear gene pairs classified as ‘syntenic:in-range’. These relationships are stored as gene attributes, and ranges of syntenic blocks are displayed with the Ensembl SyntenyView module. Table 2 shows the three pairs of genomes compared in release 31.
Gramene hosts metabolic pathway databases for eight species including rice, sorghum, Arabidopsis (19), tomato, potato, pepper, Medicago (20), coffee, as well as three reference databases, EcoCyc (21), PlantCyc (22) and MetaCyc (23). These display gene functions in the context of biochemical reactions and networks. Users can download lists of genes associated with each pathway and extract inter-specific comparisons between pathways and associated genes. Gene identifiers link to the gene summary pages of Gramene’s Ensembl genome browser, and we have added an ‘Omics Validator’ tool to map user-provided microarray probe identifiers from various microarray platforms to their respective gene identifiers, starting with rice. The mappings for the arrays are provided from the functional genomics module in the genome browser.
In the current release of the rice pathway database developed by Gramene, our curators added approximately 170 enzymatic and 80 transport reactions, revised approximately 65 tRNA and 600 transport reaction-associated genes, and updated several important rice pathways. Gramene’s RiceCyc has 342 known or predicted metabolic pathways for O. sativa japonica cultivar ‘Nipponbare’ and has undergone several rounds of data-quality enhancement and manual curation. More than 100 literature citations were added or curated. The first release of the Sorghum metabolic pathways (SorghumCyc) developed by Gramene provides 328 pathways. The pathways from rice and sorghum, both developed by Gramene, are provided in a web-based browsable form as well as for bulk download in several options including the BioPax (24) and Systems Biology Markup Language (SBML) (25) formats for advanced users. The annotated pathways are used as external references in the sorghum and rice genome browsers.
Manipulating and storing vast amounts of sequence data from increasingly cheaper and faster sequencing methodologies is a significant challenge. Gramene’s genetic diversity module is specifically designed to facilitate the integration and analysis of these data. It uses the Genomic Diversity and Phenotype Data Model (GDPDM, http://www.maizegenetics.net/gdpdm/) to store RFLP, SSR and SNP allele data, information about QTL, and passport data for wild and cultivated germplasm from rice, maize, wheat, Arabidopsis, and sorghum along with quantitative phenotypic data for some genotype accessions (Table 3).
In 2010, the GDPDM schema was updated to include a data packing system that can easily store and quickly retrieve millions of SNPs. By using binary large objects (BLOBs) in the database, we reduced the space required to store variation data by several orders of magnitude, thereby allowing us to easily query many large data sets. Gramene’s new SNP Query tool (Figure 4) uses this improvement to quickly retrieve and filter SNP data by chromosome and cultivar subgroups. The results provide information about overlapping genomic features and links to visualize them in the Ensembl genome browser. We now provide data sets for visualizing genotype patterns across cultivars of interest using the Scottish Crop Research Institute’s Flapjack program (http://bioinf.scri.ac.uk/flapjack/). A Java Web Start-enabled version of the Tassel (26) program is provided for evaluating trait associations, patterns of linkage disequilibrium and genetic diversity. In the last year, we have added many features to Tassel including a new alignment viewer, progress monitoring, pipelines and wizards for automatic data loading and analysis. For users who prefer to interact with data using their own tools, all diversity data is provided in various download formats including HapMap and PLINK at http://www.gramene.org/diversity/download_data.html.
A new entry point for plant breeders and geneticists was added by way of the ‘germplasm’ unit (http://www.gramene.org/db/germplasm/) to summarize all the curated data we hold for the most popular cultivars and wild accessions of rice. Access to this database is by species or genotype/germplasm accession instead of genomic coordinates or markers. From the germplasm home page, users can search for markers or genetic diversity information related to a particular accession.
In addition to the many custom data sets we curate in collaboration with researchers in the plant community, Gramene mirrors GenBank’s Viridiplantae sequences for our genome alignment pipeline. Gramene’s markers and sequences database now holds around 49-million records we judge to be the most valuable to our users. This database also stores the results of the alignments from our annotation results for our completed genomes as well as manually curated maps provided by the researchers/projects and those extracted from peer-reviewed publications. As this database is also the source for Gramene’s comparative maps and DAS, it is a central organizing point for users to see how markers and sequences are related to each other as well as to QTLs, source germplasm and various ontologies.
Gramene’s comparative maps database now holds almost 8M features on 214 map sets from genetic, physical, bin, sequence, cytogenetic and QTL studies. Gramene uses the CMap application (27) to allow users to create cross-species comparisons of any map type. Since last publication, we have curated from literature an additional 17 maps from rice, sorghum, barley, maize, wheat and Aegilops tauschii (28) as shown in Figure 5. Links from CMap’s feature details page allow the user to return to the source markers and sequences database to explore associations to other data sets in Gramene such as ontologies and genes.
Gramene’s QTL database (29) has seen no change to the number of QTL since our last update, holding steady at 11 624 curated QTL from 10 species. The QTL are associated to terms from trait ontology (TO), plant ontology (PO), growth ontology (GRO), environment ontology (EO), as well as to co-localized or neighboring markers and Gramene gene identifiers. A recent improvement is that users may now search for QTL by any of these associations. By following links to the various ontology term definitions, users may see genes, proteins, markers and other QTL also related to the term. The locations of rice QTL on the O. sativa japonica genome are inferred through the alignments of their associated markers. Links from the QTL details pages allow the user to view QTL on the experimental map in CMap or in the Ensembl browser where the ‘Export data’ button allows users to easily extract all the features (genes, repeats, SNPs, etc.) located in the QTL’s region.
Since our last update, we have continued to work on making our user interfaces cleaner and more informative. Our footer bar was redesigned to be smaller and less obtrusive, and the front page was redesigned to highlight Gramene’s major data sets (e.g. genes, proteins, QTL) as entry points for users (Figure 6). Also prominently featured on the front page as well as in the upper-right corner of every page is the ‘quick search’ which has itself been improved with the ability to filter results by species where applicable. For bioinformatics and software developers interested in installing a local copy of the Gramene database, we upgraded the internal web server to the most recent Apache version 2. Gramene also hosts several BioMart databases to allow users to easily execute complex queries of various data sets we hold, the results of which can be viewed in the web browser, downloaded, or integrated into the Galaxy system (30).
Sometimes the advanced user needs access to Gramene’s data through means other than our web pages, so we provide several ways to directly connect such as our public, read-only MySQL server. The host ‘gramenedb.gramene.org’ mirrors the current build of our databases and can be accessed using the password ‘gramene.’ With over 300 tracks to choose from, Gramene’s DAS can be used with our Ensembl browsers or any other DAS client to access our annotations. Recently we improved the query engine by moving from MySQL to FastBit (31), a bitmap indexing system that executes queries in a fraction of the time from MySQL. The aforementioned GDPC API also allows direct interaction with our diversity databases. Finally, Gramene continues to maintain BLAST databases for our users.
In an effort to encourage community curation, Gramene created the PlantGeneWiki (http://plantgenewiki.gramene.org/) to allow users to search genes as well as to register and contribute new and edit existing genes from plant species. Designed as an online community portal on plant genes and their annotations, the site is managed by the research community and Gramene staff.
Gramene makes all databases and software freely available under the GNU General Public License. Downloads are available from the Gramene FTP site (ftp://ftp.gramene.org). In addition, Gramene allows anonymous, read-only access to the Subversion source code repository at http://svn.warelab.org/gramene/trunk. In this way, users can have access to any previous release as well as the most current changes in our development code.
The Gramene staff uses many methods to inform, educate and interact with our users. A public news blog (http://news.gramene.org) with RSS feed capabilities is maintained to keep our users informed of changes to the website as well as important publications, job opportunities and meetings of interest to our researchers. In addition to our on-going relationship with OpenHelix (http://www.openhelix.com) (32) to provide tutorials, in the last year members of the Gramene team have been creating very short video tutorials that introduce very specific topics on Gramene or new tools and data sets (http://www.gramene.org/tutorials). Our staff also presents posters, talks and hands-on workshops at meetings such as the annual Plant and Animal Genome (PAG) conference, the Rice Technical Working Group, the Maize Genetics Meeting, Intelligent Systems for Molecular Biology (ISMB), Plant Biology and Genome Informatics.
National Science Foundation (0703908, 0851652). Funding for open access charge: National Science Foundation (0321685); NSF DBI (0703908).
Conflict of interest statement. None declared.
We would like to thank our users for their feedback and support as well as our collaborators and contributors who have supplied Gramene with data, especially NSF projects #0638566 (High Density Scoreable Markers for Maize Trait Dissection), #0321538 (An Annotation Resource for the Rice Genome), #0606461 (Exploring the Genetic Basis of Transgressive Variation in Rice), #0723510 (Collaborative Research: An Arabidopsis Polymorphism Database), #0723510 (Collaborative Research: An Arabidopsis Polymorphism Database), #0638820 (OMAP), #0701916 (Physical Mapping of the Wheat D Genome), #0743804 (POPcorn), #0543441 (NextGen PLEXdb), #0638820 (The evolutionary genomics of invasive weedy rice) and the USDA-ARS CRIS 9235-21000-013-00D (Complete Switchgrass Genetic Maps Reveal Subgenome Collinearity, Preferential Pairing and Multilocus). Gramene is deeply indebted to our Science Advisory Board members Paul Flicek, Michael Ashburner, Anna McClung, Georgia Davis, David Marshall, Patricia Klein, William Beavis, Tim Nelson for their critical comments, suggestions and improvements. We also thank Peter Van Buren for his excellent system administration work.