|Home | About | Journals | Submit | Contact Us | Français|
ParameciumDB is a community model organism database built with the GMOD toolkit to integrate the genome and biology of the ciliate Paramecium tetraurelia. Over the last four years, post-genomic data from proteome and transcriptome studies has been incorporated along with predicted orthologs in 33 species, annotations from the community and publications from the scientific literature. Available tools include BioMart for complex queries, GBrowse2 for genome browsing, the Apollo genome editor for expert curation of gene models, a Blast server, a motif finder, and a wiki for protocols, nomenclature guidelines and other documentation. In-house tools have been developed for ontology browsing and evaluation of off-target RNAi matches. Now ready for next-generation deep sequencing data and the genomes of other Paramecium species, this open-access resource is available at http://paramecium.cgm.cnrs-gif.fr.
The pioneering work of Tracy Sonneborn established Paramecium as a unicellular model of great interest for studies of cellular organization and non-mendelian heredity, over half a century ago. Paramecium and other ciliates are the only unicellular eukaryotes that separate germ line and somatic functions. Germ line micronuclei undergo meiosis and transmit the genetic information to the next sexual generation. A somatic macronucleus contains a rearranged version of the genome streamlined for gene expression and is the seat of all transcriptional activity during vegetative growth.
The somatic genome of the Paramecium species most used for genetic studies, Paramecium tetraurelia, was sequenced and annotated (1), revealing nearly 40000 protein-coding genes. This surprisingly large number of genes, especially for a unicellular organism, turned out to be the result of a series of at least three whole genome duplications (WGD). These rare and dramatic events, previously documented in plants, animals and fungi, are resolved over evolutionary time by the loss of one duplicate for the majority of genes (2). In P. tetraurelia, the most recent WGD (51% of pre-duplication genes still in two copies) occurred just before the explosion of speciation events that gave rise to the Paramecium aurelia complex of 15 sibling species (1,3) while the old WGD (8% of genes still in two copies) probably occurred before the divergence of Paramecium and Tetrahymena (1,4).
Paramecium remains an outstanding model for studies of cilia and ciliary basal bodies (5,6) and of genome rearrangements and their epigenetic control (7,8). The remarkable ease and efficiency of RNAi by feeding in this organism provides a powerful tool for analysis of gene function (9). In addition, Paramecium now constitutes a unique system to study the mechanisms of genome evolution consequent to gene duplication, since a very low rate of large-scale genome rearrangements in the Paramecium lineage (10) made it possible to identify an unprecedented large number of WGD paralogs of different ages and because the P. aurelia species complex provides the opportunity of testing the role of reciprocal gene loss in speciation.
ParameciumDB was inaugurated at the same time as the publication of the somatic genome sequence (1,11). The mission of this open-access resource is to integrate genomic and post-genomic data, improve genome annotations and provide information about the biology of the reference species P. tetraurelia. The database strives to be standards-aware, provide downloadable data sets and database dumps and assure curation involving the research community. Since its inauguration, ParameciumDB has acquired new computational and post-genomic data, new tools for viewing, searching and annotating the data and is now ready to accommodate next-generation sequencing (NGS) and the genomes of other Paramecium species.
ParameciumDB uses the modular, standards-aware Chado relational database schema (12). All data is typed using controlled vocabularies such as the Sequence Ontology (SO) (13) and microarray data is stored using the MIAME-compliant MAGE module. Phenotypes associated with mutant strains and alleles are represented by the entity–quality (EQ) model and the PATO Quality Ontology (14).
Since the initial release of the database, new post-genomic data sets have been introduced. Two sets of proteome peptides are currently available in ParameciumDB. The first set is from the infraciliary lattice (ICL), a continuous contractile network that constitutes the innermost layer of the Paramecium cortical cytoskeleton. The ICL is composed of small, EF-hand Ca2+-binding proteins known as centrins and large Sfi-related centrin-binding proteins that transduce the Ca2+ sensed by the centrins into macroscopic cellular contraction (15,16). The second set of proteome peptides are from isolated cilia (6). The gene expression data currently available in ParameciumDB is from genome-wide transcriptome experiments using a NimbleGen custom microarray platform (17). The experiments identified genes differentially expressed during the sexual cycle of autogamy, during reciliation and during recovery from massive exocytosis that stimulates secretory granule biogenesis. A genome-wide study of the Paramecium kinome (18) has likewise been incorporated, as has annotation of non-coding RNA (19). Figure 1 shows how post-genomic data is displayed in the genome browser.
Publications concerning Paramecium are now mined from PubMed once a month and curated to establish links to genes and to add the published gene nomenclature to ParameciumDB. Orthologs in 33 species predicted using Inparanoid (20) are shown on every gene page and the detailed view provides protein alignments and links to external databases.
Supplementary Table S1 shows the currently available data, the ontology used to type the data and the study that generated the data, if relevant.
The GMODWeb framework used to render ParameciumDB gene and protein pages has been described (11,21). Since the initial release of the database, the data in ParameciumDB is now used to populate a BioMart data warehouse (22) to allow complex queries. It is possible to download customized reports of the query results as tabulated text files, Excel files with active links to ParameciumDB gene pages, or dumps in standard SQL or GFF3 formats. The sequences can be downloaded in fasta format. Since the Paramecium BioMart is available through the EBI portal, Paramecium data sets can be accessed by Galaxy (23), a web-based data analysis platform.
ParameciumDB now uses the latest release (version 2.03) of GBrowse (24). In addition to providing a much improved user-experience through a number of major improvements, GBrowse2 makes it possible to visualize next-generation deep sequencing data. The Bio::DB::Sam Perl module port of Samtools (25) allows the browser to render binary files with mapped short reads and even zoom down to the sequences. Although no NGS data is currently available (July 2010) through the public version of ParameciumDB, we have used the same software to make a tool that evaluates RNAi off-target matches. The tool is intended to help researchers design RNAi reagents that will target only the gene(s) of interest, which it does by cutting the sequence provided by the user into overlapping user-specified windows (23nt, the size of Paramecium siRNA, is the default value). The short sequences are then mapped to the genome, with a choice of 0–3 mismatches, using the BWA aligner (26). A binary file is created on the fly with Samtools and displayed by Gbrowse2. This tool is also useful for identification of paralogs/repetitions of chosen coding or non-coding regions of the genome. A track in the genome browser, ‘RNAi off-target’, was pre-calculated using the same approach but gives the inverse image, as it shows how many 23nt stretches from elsewhere in the genome map to a given genomic position (Figure 1).
Other new tools include a motif finder and an early version of an ontology browser that links Gene Ontology (GO) terms to genes. Table 1 provides an exhaustive list of the tools available in ParameciumDB.
Since ParameciumDB does not have any full time curators, we interfaced the Apollo genome editor (27) to ParameciumDB so that community members can query an annotation instance of the ParameciumDB chado database to view regions of the genome with many kinds of evidence and create new genome annotations. Once a month, the new expert annotations tagged ‘finished’ are incorporated into ParameciumDB, with the name of the curator and comments. The latest version of ParameciumDB proteins, including the new curated annotations, then becomes available for download as a fasta file and can be chosen as database for Blast searches.
Using wiki media, we recently added the ‘parawiki’ to ParameciumDB for documentation, protocols and nomenclature guidelines (a complete version of the nomenclature guidelines is provided as Supplementary data). The parawiki is also used to provide information and registration for meetings. With permission from Cold Spring Harbor Laboratory Press, the parawiki includes the complete version of short articles about the biological uses of Paramecium that constitute, in an abbreviated from, a chapter in the Emerging Model Organism Series (28). We welcome additional contributions from the community.
ParameciumDB is ready to map and display NGS data, which is already being generated to study gene expression, identify mutations, polish the sequence of the P. tetraurelia reference somatic chromosomes and determine the sequence of the germ line genome. We also plan to develop ParameciumDB as a resource for comparative studies. The high-throughput and relatively low cost of NGS is making it possible to consider sequencing the genomes of other Paramecium species, such as the members of the P. aurelia sibling complex. Indeed, Paramecium biaurelia sequencing is already well advanced (M. Lynch, personal communication). ParameciumDB, which uses the same Chado database scheme originally designed by FlyBase to prepare for the Drosophila 12 genomes project (29), is ready for incorporation of the genomes of more species. However, a big challenge will be to integrate and/or develop tools for viewing and analyzing synteny within and between genomes. GBrowse_syn (http://gmod.org/wiki/GBrowse_syn) might provide a good solution.
ParameciumDB funding is provided by the Centre National de la Recherche Scientifique and the Agence National de la Recherche (project ParaDice ANR-08-BLAN-0233). Funding for open access charge: Agence National de la Recherche (ANR).
Conflict of interest statement. None declared.
Supplementary Data are available at NAR Online.
The authors would like to thank the GMOD community for the software components used by ParameciumDB and support. They are indebted to Mark Gibson, Jonathan Crabtree and Scott Cain for help with the interface of Apollo to ParameciumDB. They thank all members of the Cohen and Bétermier labs at the Centre de Génétique Moléculaire for testing and feedback and the Paramecium research community for contributed data and suggestions for improvements. ParameciumDB is developed in the context of the CNRS European Research Group ‘Paramecium Genome Dynamics and Evolution’.