|Home | About | Journals | Submit | Contact Us | Français|
The MIPS Fusarium graminearum Genome Database (FGDB) was established as a comprehensive genome database on one of the most devastating fungal plant pathogens of wheat, barley and maize. The current version of FGDB v3.1 provides information on the full manually revised gene set based on the Broad Institute assembly FG3 genome sequence. The results of gene prediction tools were integrated with the help of comparative data on related species to result in a set of 13.718 annotated protein coding genes. This rigorous approach involved adding or modifying gene models and represents a coding sequence gold standard for the genus Fusarium. The gene loci improvements results in 2461 genes which either are new or have different structures compared to the Broad Institute assembly 3 gene set. Moreover the database serves as a convenient entry point to explore expression data results and to obtain information on the Affymetrix GeneChip probe sets. The resource is accessible on http://mips.gsf.de/genre/proj/FGDB/.
The ascomycete Fusarium graminearum (anamorph Gibberella zeae) is the causal agent of several plant diseases of world-wide economic importance (1). Fusarium head blight of cereals and Fusarium ear rot of maize lead to severe yield losses and quality problems. Most importantly, mycotoxins (2) produced by the pathogen contaminate infected plant material and derived food and feed products leading to a health risk. To protect consumers and to avoid a negative impact on farm animals, maximum tolerated levels for Fusarium toxins have been enacted in many countries and costly mycotoxin monitoring programs were implemented. The most sustainable solution to the problem seems to be breeding resistant plants. Yet, this is difficult, because the molecular basis of quantitative resistance differences are not understood (3). The pathogen has a very broad host range and seems to be able to suppress plant defense responses in ways that are currently not understood or to a very limited extent (4). The elucidation of fungal virulence mechanisms and the identification of virulence genes that can be targeted by breeding or biotechnological approaches is the main goal of a large research community. As a first step in the development of genomics tools for F. graminearum and as a basis for functional genomics approaches, the full genome sequence of one F. graminearum strain was determined (5).
The setup of the first version of the FGDB (6) was supported by a project funded by the Austrian genome initiative GEN-AU and was based on the first genome assembly. It already focused on manual improvements of gene calls. The intuitive user interface allowed access to the data through various search and browsing methods. Input from the research community enhanced the annotation effort and established the resource as a key tool for F. graminearum genomics (5).
The current FGDB v3.1 (http://mips.gsf.de/genre/proj/FGDB/) aims to provide a comprehensive resource for the international research community based on the latest assembly of the genome sequence and on a manually revisited set of 13.718 genes, 319 tRNAs and genetic markers with a detailed functional annotation and bioinformatic analysis. In addition, the database was expanded to provide convenient access to available GeneChip expression data.
The source data for FGDB were provided by the F. graminearum sequencing project at the Broad Institute, which is supported by the National Research Initiative being part of the US Department of Agriculture’s (USDA’s) Cooperative State Research Education and Extension Service. The current content of FGDB v3.1 is based on the Broad assembly 3 resulting in 31 supercontigs (7). The Broad Institute used the previous FGDB version 1 with its manually revised gene calls to improve their current gene set. Based on this set, all gene loci in FGDB v3.1 were re-annotated using a pipeline including (i) Fgenesh with different matrices (www.softberry.com); (ii) GeneMark-ES (8); (iii) Augustus with ESTs, precedingly annotated Fusarium models and/or Neurospora crassa protein sequences as training data or as hints for the predicted model structure (9); and (iv) EST data as well as Blastx data of related Fusarium species (F. verticillioides, F. oxysporum and F. solani). The different models were displayed in GBrowse (10) allowing comprehensive manual validation of the coding sequences (CDSs). The best fitting model per locus was selected manually and in case for required changes, respective gene calls were manually corrected using Apollo (11). The gene identifiers have been retained unchanged from the Broad FG3 gene set if the model was identical. All altered (1770) or newly added gene calls (691) are named FGSG_15xxx and above. The outdated draft identifiers used for the Affymetrix GeneChip design (fgdxx-xxx, 13938 genes) (12) and the corresponding FG1 identifiers (fgxxxxx, 11640 genes) are listed as alias in the entry pages and are linked to Pedant databases for details (13).
The ORF data and resulting protein sequences are imported in the Pedant system for a detailed functional and structural bioinformatic analysis. The core results are re-imported into FGDB for convenient display and indexing. The Pedant analysis details are inter-linked with each FGDB entry. The assembly 1 data were used for the design of an Affymetrix GeneChip (12). The single probes were mapped on the supercontigs using Blat at 100% identity. Probe sets corresponding to gene loci are searchable and visualized in the GBrowse viewer. The initial expression analysis results are integrated for a brief overview on the expression of single genes. Similarity based data (e.g. homology between protein pairs) is retrieved from and interlinked to the Similarity Matrix of Proteins (SIMAP), which is updated on a monthly interval (14).
Comparison of the Broad FG3 and FGDB v3.1 annotated gene sets indicate that 11257 genes (82%) are exactly the same in terms of exon/intron structure. A total of 2461 genes in the Broad set either have a different structure or are absent from FGDB. A total of 2056 genes in FGDB either have a different structure or are absent from the Broad data. With the evidence of protein similarity to related species, 26 genes in the Broad set have been split into two or more genes in FGDB while 147 genes in FGDB were merged from two or more genes of the Broad set. Overall, FGDB v3.1 contains 383 more introns than the Broad set, with a decrease in mean intron length from 83.4 to 76.6nt. Both annotation sets have ~65% of genes annotated with at least one putative InterPro domain (15). The average number of domains annotated per gene for both Broad and FGDB is ~1.7. As judged by confirmation of introns by available ESTs, both Broad and FGDB are of similar quality indicating that the validation of gene calls by available EST data was similarly efficient for both pipelines.
There are 103, 55 and 1651 proteins predicted only in FGDB, only in Broad and in both annotation sets as part of the secretory pathway [TargetP, RC < 4 (16)], respectively. In particular, both Broad and FGDB models now enable secretion prediction of FGSG_17357 (related to inorganic pyrophosphatase IPP1) and FGSG_12369 (related to catalase 2) as identified previously in an extracellular proteomics study (17) on models without SignalP signals (18). In addition, FGDB predictions help confirm the secretory pathway membership of hypothetical protein FGSG_16372 as identified in that study.
The database interface provides basic search options on the sidebar which allows full text search across gene codes, gene symbols and gene description. In addition, the annotation catalogs FunCat (19), Enzyme Class (20), InterPro (14) and Protein Class are browsable. The advanced search page offers access also to invalid gene models which disagree with known evidences, details on the GeneChip data like probe and probe set names and their location (12), tables on tRNAs and a customizable table on protein molecular weights and isoelectric points. The ORF / contig DNA and protein sequences are searchable by Blast.
The single entry page of a gene locus lists information on outdated gene models, alias names and protein classification (six classes from known to hypothetical). Beside physical features like contig coordinates, molecular weight, etc., the hierarchical, functional classification FunCat (19) and EC-number classes (20) as well as InterPro IDs (15) and TargetP (16) results are provided. SIMAP based protein homology data can be retrieved using links grouped by NCBI-based taxonomic categories.
The Pedant links shown in the individual gene records forward to the respective Pedant report pages including alternative views on the DNA level as well as a graphic protein feature view. A small contig pictogram on the right side of each individual gene report page is linked to a GBrowse view allowing graphical browsing of genes, GeneChip probes, EST data and outdated gene models on their corresponding contigs.
To get a brief overview on the initial expression analysis data (12,21–23) for single genes, the ‘Expression Data’ link placed below the contig pictogram provides a brief description of experiments and presents the expression data for all matching probe sets. In addition, a more comprehensive overview of the most recent expression data is provided by a link to the ‘PLEXdb GeneOscilloScope’ (24). The advanced query option (Index Search) on the left panel can be used to retrieve a list of the current FGDB entries based on complex queries including InterPro domains, TargetP results and e.g. probe set names (e.g. “fgd122-100_at”[pgs]|“fgd122-620_at”[pgs]). For this purpose, the major database fields are indexed which allows a fast and combined ‘index search’ (http://mips.gsf.de/genre/proj/FGDB/Search/Gise/).
The data can be downloaded from ftp://ftpmips.gsf.de/FGDB/. Beside the protein, contig and chromosome sequence file in fasta format the ORF data is provided in gff3 format. Functional data like FunCat, TargetP and InterPro are accessible in tab-delimited files.
The FGDB v3.1 is a comprehensive resource on the fungal plant pathogen Fusarium graminearum and facilitates a user friendly access to gene structure and functional data. Protein homology-based data from public genomes is routinely updated. Although the ORFeome is completely revised in this version, updates on single gene structures are likely to come as new sequence data of further F. graminearum strains and closely related species or EST data are available in future. We encourage any input of additional evidence to further improve the gene set and overall annotation of the genome. Submitted links to gene specific publications, contact information on existing mutation strains and other details will also be included.
Austrian Science Fund FWF (special research project Fusarium, F3702 and F3705). Funding for open access charge: Helmholtz Zentrum München, German Research Center for Environmental Health, Ingolstädter Landstrasse 1, D-85764 Neuherberg, Germany.
Conflict of interest statement. None declared.