Active import and curation of new types of C. elegans
data continues to be one of the primary activities in the maintenance and development of WormBase. The past 2 years have seen the incorporation of modENCODE (8
) data along with other large-scale data sets; the development of a Worm Phenotype Ontology [WPO; (16
)]; adaptation of Serial Patterns of Expression Levels Locator [SPELL; (17
)] to house microarray data; and the incorporation of new data classes such as molecules, images and human disease connections. We discuss these data types below.
modENCODE data was added to the primary C. elegans Genome Browser in June 2010; curators are using modENCODE data for sequence curation and have devised strategies to integrate these data into WormBase. modENCODE data sets include UTRome features, pseudogene curation targets, Highly Occupied Target (HOT) regions, polyA sites, ncRNA genes and aggregate coding gene models. These data sets have been subjected to rigorous internal quality control and fully integrated into the database.
Gene model curation
WormBase continues to maintain a manual gene curation program whereby gene structures are corrected in line with all currently available data for a given locus. This is managed and streamlined via the use of the Sequence Curation Tool (CT) an in-house developed software suite [see below; (18
)]. The integration of large data sets such as modENCODE has provided valuable extra evidence for gene model curation. RNASeq data from modENCODE has been used to discover anomalies that highlight potential cases where adjacent genes could be merged. Resolving these anomalies alone has so far resulted in the improvement of over 100 gene models.
Representation of miRNAs has been rationalized and extended so that there is now a clear distinction between mature miRNA products and primary transcripts. Integration of additional large datasets included polyA sites generated by a project not associated with modENCODE (19
). Combining these with the modENCODE data has resulted in the assignment of polyA sites to >80% of coding genes. genBlastG (20
) gene models for C. briggsae
, C. brenneri
and C. remanei
have also been incorporated into the database. These gene models were computed by projection of C. elegans
gene models, and have been helpful for the curation of these genomes.
Whole genome sequencing data
One of the key challenges faced by WormBase is the rapid growth of C. elegans
strain variation data generated by Whole Genome Sequencing (WGS) projects. The strains from which these data sets are derived vary, ranging from wild isolates to laboratory-manipulated mutants. We continue to investigate and develop mechanisms for the efficient storage, processing and visualization of these data sets. The acknowledged canonical resource for the management and archiving of variation data is dbSNP (21
). We strongly encourage projects to submit their data to dbSNP, and continue to act as a submission broker in cases where a laboratory lacks the technical resources to conform to the dbSNP submission protocols. While dbSNP acts as the primary repository for the data, WormBase adds curated and computationally derived value, for example putative gene consequence, and provides full cross-referencing back to the dbSNP primary records. To date, WGS data from six projects (one ongoing) have been integrated into WormBase and submitted to dbSNP [Andersen et al.
, manuscript in preparation; Moerman and Waterston, manuscript in preparation; (22–25
)] This amounts to a total of about 400
Worm phenotype ontology
We have continued to develop the WPO and have added 115 new phenotype terms this past year, bringing the total number of terms to 1985. New terms are added in parallel to the curation process, allowing us to remain up-to-date with the field. The WPO was published as a resource for the scientific community (16
). Currently, the Biological General Repository for Interaction Datasets [BioGrid; http://thebiogrid.org
)] database is utilizing the WPO for the annotation of phenotypes associated with genetic interactions in C. elegans.
All C. elegans
related microarray datasets from Gene Expression Omnibus [GEO; (27
)] and ArrayExpress (28
) have been imported into WormBase. Probe-centric microarray data are mapped to the latest version of the C. elegans
genome for each WormBase release to generate gene-centric data, which are stored in a MySQL-based SPELL database [http://spell.caltech.edu:3000/
)]. These displays also include expression levels from RNAseq datasets.
We are now extracting published images from expression pattern analyses and will expand this curation to include images of other data types. To make the process more efficient, effort has been devoted to automating image acquisition. To display published images, permission for each individual image has to be obtained from the publisher. To date, permission has been obtained from 27 major publishers and WormBase is negotiating with several others. We are also working on automating the process of requesting permission. Before this project began, 7228 images were directly submitted by a small number of laboratories engaged in large-scale projects. These images will be added to over 2000 images now extracted from the literature. Each image is manually curated and associated with a gene, anatomical structure and cellular component.
Molecule curation captures small molecules and drugs that modify or cause phenotypes in a mutant background or RNAi-based experiments, and/or cause changes in gene-regulation activity. This data class has been populated with molecules from ChEBI (http://www.ebi.ac.uk/chebi/
), the National Library of Medicine (http://www.nlm.nih.gov/mesh/MBrowser.html
), the Comparative Toxicogenomic Database (CTD; http://ctd.mdibl.org/
) and Small Molecule Metabolite (http://www.SMMID.org
), which act as sources of IDs, names and synonyms for assigning molecule annotations to WB data. Over 600 molecule connections to gene and RNAi and variation phenotype objects have been created since the beginning of this data type curation.
Human disease gene orthologs
WormBase provides curated, concise descriptions of genes based on the reading of published literature. These are free-text and include information about gene orthology, function and expression. Since C. elegans is an important animal model that is increasingly used for the study of human disease, we write these gene descriptions with emphasis on the orthologies to human disease genes, and how their study in C. elegans has informed the disease field. This information will be highlighted with a special ‘Human disease relevance’ tag, for the benefit of both the C. elegans and non-C. elegans researcher. We plan to facilitate queries to serve as a portal through which one can access relevant information from the nematode field, for example, a query using either a human gene name or disease name will lead the user to the relevant C. elegans gene.