|Home | About | Journals | Submit | Contact Us | Français|
PomBase (www.pombase.org) is a new model organism database established to provide access to comprehensive, accurate, and up-to-date molecular data and biological information for the fission yeast Schizosaccharomyces pombe to effectively support both exploratory and hypothesis-driven research. PomBase encompasses annotation of genomic sequence and features, comprehensive manual literature curation and genome-wide data sets, and supports sophisticated user-defined queries. The implementation of PomBase integrates a Chado relational database that houses manually curated data with Ensembl software that supports sequence-based annotation and web access. PomBase will provide user-friendly tools to promote curation by experts within the fission yeast community. This will make a key contribution to shaping its content and ensuring its comprehensiveness and long-term relevance.
The fission yeast Schizosaccharomyces pombe is a well-studied eukaryotic model organism that has been used since the 1950s to obtain valuable insights into diverse eukaryotic biological processes including the cell growth and division cycle, genome organization and maintenance, cell morphology and cytokinesis, signaling and stress responses, chromatin and gene regulation and meiotic differentiation (1). Moreover, since the completion of its genome sequence in 2002 (2), fission yeast has emerged as a prime model for the characterization of processes relevant to human disease and cell biology. A large and active community engages in biological and biomedical research using this model system, routinely applying molecular genetic, cell biological and biochemical techniques. Data from small- and large-scale projects are accumulating rapidly and will increase substantially over the next few years; the literature corpus currently exceeds 9000 publications and grows by about 500 publications per year.
PomBase (http://www.pombase.org) has recently been established to provide user-friendly and standardized access to genomic features and annotations, enabling scientists to assimilate novel findings into their research programs, improve experimental design, support the interpretation of genetic screens, and facilitate the interpretation of functional genomics and systems biology experiments. An accurate and comprehensive set of manual annotations of gene products based on published data lies at the centre of this database, and is supplemented by automatic annotation and information about non-genic features. The PomBase project aims to provide three key resources to the fission yeast community:
Historically, functional information about the genome and biology of S. pombe has been maintained in a repository hosted by the GeneDB project at the Wellcome Trust Sanger Institute (WTSI) (http://old.genedb.org/genedb/pombe/). This resource is now superseded by PomBase, which has not only inherited the data from GeneDB, but also includes additional curated data types and high-throughput data sets. The major data types available in PomBase are listed in Table 1.
DNA and protein features are annotated using the Sequence Ontology (SO) (4). Currently, 31 DNA feature terms are used in 25650 annotations (examples include gene, exon, tRNA, centromere) and 22 protein feature terms for 919 annotations (examples include nuclear localization signal, ER retention signal, DDB box).
Gene Ontology (GO) (5,6) terms are assigned to gene products to represent their molecular functions, cellular components (including complexes), and biological processes. PomBase GO annotation data comprises over 33500 manual annotations and approximately 3500 automatically assigned annotations, using 3800 unique GO terms.
To support the comprehensive and detailed representation of phenotypes, we are developing the Fission Yeast Phenotype Ontology (FYPO), a formal ontology of phenotypes observed in fission yeast. FYPO is a modular ontology that uses several existing ontologies from the Open Biological and Biomedical Ontologies (OBO) collection (7) as building blocks, including the phenotypic quality ontology PATO (8), GO, and Chemical Entities of Biological Interest (ChEBI) (9). Over 7000 existing annotations have been converted from the legacy GeneDB controlled vocabulary to FYPO terms; these annotations will support sophisticated querying, computational analysis, and comparison between different experiments and even between different species.
Annotation of genetic and physical interactions are supported using the BioGRID (http://thebiogrid.org) (10,11) annotation format. Existing annotations curated by BioGRID are imported into PomBase, and newly created annotations will be exchanged with BioGRID.
PomBase incorporates a wide variety of data sets that can be mapped to the genome, obtained either from internal sources or via externally loaded URLs or data files. Examples of supported data sets include whole genome re-sequencing data, RNA-seq and ChIP-seq data, various microarray data and other high-throughput data types.
A web portal has been developed for access to the PomBase data, which provides pages describing the current state of annotation of the S. pombe genome, items of interest to the community, and, most importantly, a ‘Gene Overview Page’ that summarizes key information about each gene. Embedded in this portal is a genome browser, providing access to genomic context, sequence-based analyses and high-throughput data.
Gene Overview pages organize gene-specific information, including the gene type, product description, sequence features, phenotypes, Gene Ontology annotation and protein modifications as well as physical and genetic interactions. These pages are central to PomBase (see Figure 1).
A simple search is available on every PomBase page. The Advanced Search allows users to perform queries on multiple feature types including GO annotation, protein domain, characterization status, species distribution, protein length, etc. A query history summarizes queries and allows them to be edited or combined using union or intersection.
The PomBase genome browser has been implemented using software developed by the Ensembl project (12). The Ensembl genome browser is a powerful system, offering support for the visualization of sequence, functional annotations, alignments, comparative data and polymorphisms. Many of these features are already exploited by PomBase, and others will be used as and when required for the incorporation of new data types. The use of standard technology readily supports comparative analyses with the genomes of other species that are accessible via Ensembl (also see ‘Implementation’ section, below). Comparative analyses with other fungal genomes and with a range of taxonomically diverse genomes are provided using data generated by the Ensembl Genomes project (13). The Ensembl API enables users interested in comparative analyses to retrieve all data of interest from multiple species present in Ensembl.
The implementation of PomBase harnesses and integrates three complementary, well supported and mature technologies, Chado (14), Ensembl and Drupal (http://www.drupal.org). Chado provides an environment for manual curation and the management of curated data while Ensembl provides end-user access via the web portal and display of sequence-based features.
Curated PomBase data are stored in a PostgreSQL database using the Generic Model Organism Database (GMOD)-compliant Chado schema. Chado supports the management and storage of sequence annotation and literature curation using any combination of available ontologies, and is therefore easily extensible to new data types. Sequence features are curated in Chado using Artemis (15,16). New annotations produced by the PomBase curation team and the fission yeast community, and external data from BioGRID and UniProt/GOA (17), are loaded at regular intervals.
Ensembl is a generic software platform for the automatic annotation, analysis and display of genomes, in use for over a decade. In PomBase, Ensembl provides public access to the integrated fission yeast data, and has been extended to display the deep literature curation managed in Chado. Gene model and annotation data are retrieved from the Chado database using the Bio::Chado::Schema Perl API, and loaded into an Ensembl MySQL database using the Ensembl Perl API. Data in the Ensembl MySQL database are accessible directly or via the Ensembl Perl API, and provide the content served both in the Genome Browser and in the Gene Overview pages. Data from the Ensembl database are also loaded into a BioMart data warehouse (18,19) which supports data mining via web and programmatic interfaces and also provides support for the advanced search (see below).
Drupal is a content management system that has been used to provide the web-based portal for PomBase. The use of Drupal has allowed the creation of a clean and intuitive user interface to access information about S. pombe and the PomBase project, and to support community-based functionality including wikis and discussion forums. To support the PomBase interface, two custom Drupal modules have been developed. The Gene Overview module is responsible for generating the Gene Overview pages by retrieving data about specific genes from a custom web service running on the Ensembl web server, which in turn uses the Ensembl Perl API to query the Ensembl MySQL databases. The Query Builder module supports the advanced search interface, and generates and submits custom queries to the BioMart web service to find genes matching specified criteria.
PomBase welcomes contributions from the community to improve the coverage and accuracy of its data. Users wanting to add or modify data in PomBase can directly contact the curation staff (E-mail: gro.esabmop@ksedpleh).
Following a successful pilot project conducted in 2008, a generic web-based curation environment is being developed to support the launch of a comprehensive community curation initiative early in 2012 (Rutherford et al., in preparation). This will allow expert users to directly contribute annotations based on their publications, and will enhance the efforts of core curation staff and contribute to the sustainability of the curation effort in the face of increasing volumes of highly specialized published data.
Additionally, there are many ways that users can directly visualize sequence-based data within the context of the web browser, including access to a Distributed Annotation System, data upload to a private area of the site, and dynamic integration of locally stored BAM files (using standard protocols such as HTTP, allowing users to directly visualize large-scale experimental results in the context of the reference annotation). Producers of mature data sets of any scale that are ready for full integration in PomBase and public dissemination should contact PomBase at the address above.
All of the tools, protocols and workflows developed by PomBase are publicly available (http://www.pombase.org/downloads) and can be implemented by other research communities to create analogous organism-specific databases either in collaboration with the Ensembl Genomes project or independently. Sequence, features and other annotation are available for bulk download via FTP, while subsets of data can be selectively downloaded using the PomBase BioMart.
PomBase will continue to incorporate large-scale data sets and curate new data types. We will also incorporate sequence data, automatic annotation, and high-throughput data sets available for other species in the Schizosaccharomyces genus (20).
We anticipate that usage of PomBase will extend beyond the S. pombe community to encompass evolutionary biologists studying genome variations and the evolution of yeasts, fungi, and the eukaryota in general; researchers seeking well-studied orthologs of genes of interest in human and other species; curators from other databases; and bioinformaticians and theoretical biologists requiring programmatic access to fission yeast data in order to construct and test novel hypotheses.
Wellcome Trust [WT090548MA to SGO]. Funding for open access charge: Wellcome Trust.
Conflict of interest statement. None declared.
The authors thank members of the Ensembl and Ensembl Genomes teams for contributions to the PomBase data pipelines. We also thank Chris Mungall for helpful discussions on Chado and GO, and we thank members of the S. pombe research community whose feedback has helped establish priorities for PomBase development.