|Home | About | Journals | Submit | Contact Us | Français|
WormBase (http://www.wormbase.org), the model organism database for information about Caenorhabditis elegans and related nematodes, continues to expand in breadth and depth. Over the past year, WormBase has added multiple large-scale datasets including SAGE, interactome, 3D protein structure datasets and NCBI KOGs. To accommodate this growth, the International WormBase Consortium has improved the user interface by adding new features to aid in navigation, visualization of large-scale datasets, advanced searching and data mining. Internally, we have restructured the database models to rationalize the representation of genes and to prepare the system to accept the genome sequences of three additional Caenorhabditis species over the coming year.
WormBase is the model organism database for the biology and genomics of Caenorhabditis elegans and Caenorhabditis briggsae. It is a rapidly evolving resource, which is driven by the fact that C.elegans is widely used as a model organism for a variety of biomedical research topics, including development, neuroscience, apoptosis and aging (1–4), and an increasingly wide range of high-throughput data is available for it. The genome sequence of C.elegans (5) has boosted genome-wide research projects including ORFeome (6), RNA interference (RNAi) (7), microarray (8), interactome (genome-wide protein–protein interactions) (9), serial analysis of gene expression (SAGE) (10,11) and other gene expression profiling techniques (11). These large-scale datasets have enormously enriched WormBase content (2,3). More recently, the availability of the whole C.briggsae genome sequence (12), in addition to that of C.elegans, has established WormBase as a platform for comparative genomics among the Caenorhabditides genus (13).
The International WormBase Consortium, consisting of over 30 scientists from four institutions (http://wormbase.org/about/people.html), collects and annotates both large- and small-scale datasets from C.elegans, C.briggsae and related nematodes, organizes them in a single public database, and makes them available for browsing and downloading on the WormBase website. In addition to acquiring directly deposited data by liaison with the research community, the consortium reviews and extracts data from the complete Caenorhabditis published literature. New releases of the database are made available every two weeks, ensuring that new and updated datasets are available to the community on a timely basis. This paper reviews recent progress in WormBase content and improvements in the user interface, explains how WormBase is evolving and discusses different methods of accessing the data. The paper closes with a discussion of new features planned for the coming year.
Over the past year we have greatly increased the sizes of some existing datasets. For example, there is a 5-fold increase in microarray data points and a dramatic 13-fold increase in microarray experiments, from 8 experiments (reported in 2 papers) to 113 experiments (reported in 15 papers). The number of RNAi experiments producing a non-wild-type phenotype has also more than doubled over the past year.
We continue to refine C.elegans gene models on the basis of new data appearing in the literature, from new sequence data in the public nucleotide databases (GenBank/EMBL/DDBJ), and from personal communications from the Worm community. Most curation activity involves refining the structure of existing gene models. However, we also continue to remove gene predictions that are no longer valid (e.g. very short open reading frames) and we continually add new gene predictions where appropriate (usually corresponding to new isoforms of an existing gene). Despite large numbers of genes being created and removed, the total gene count (for protein-coding genes) has seen only a small net increase (+22 genes) over the year. In contrast to this, the proportion of protein-coding genes that are now confirmed by transcript data (i.e. where every coding exon has transcript support) has increased by 20% (from 4663 to 5569) over the same period. This is due to the availability of more transcript data [particularly expressed sequence tags (ESTs)] and the work of curators to refine gene models to better fit the available transcript data. We have also greatly improved the methods by which transcripts are mapped onto the genome and connected to gene models.
Over the same period, WormBase has added several new large-scale experimental and theoretical datasets. Notable additions include large-scale SAGE datasets (10,11), the interactome dataset (9), 3D structural data and the National Center for Biotechnology Information (NCBI) KOGs (14) set of predicted orthologous groups. Recently, the newly developed technique trans-spliced exon coupled RNA end determination (TEC-RED) has been used to assay the 5′ ends of expressed genes in C.elegans (15) and the dataset is being curated and entered into WormBase.
SAGE (10,11) is a sensitive technique for assaying genome-wide gene expression levels that provides a good complement to microarray-based techniques. As of release WS123, WormBase incorporates the results of 12 SAGE libraries, two of which have been published previously (10). The 12 libraries cover various developmental stages (11) from embryo to adult and touch 20417 genes (coding sequences, WS129) corresponding to 91.9% of all genes annotated in the C.elegans genome in WormBase (22213 including alternatively spliced coding sequences, WS129). SAGE tags corresponding to a gene can be found at the bottom of the WormBase gene page (e.g. http://www.wormbase.org/db/gene/gene?name=ced-3#Reagents) and are linked to information detailing the SAGE tag's abundance at various life stages in a new SAGE report page (Figure (Figure11).
Dissecting a protein's interaction network is often a key to understanding its biological role. WormBase includes the results of the ‘Interactome Project’, a large-scale screen based on the yeast two-hybrid (Y2H) technique (9). In the current dataset, baits are biased towards genes either homologous to human genes, of multicellular functions (genes with homologues in multicellular organisms including Drosophila melanogaster, Homo sapiens and Arabidopsis thaliana but not in Saccharomyces cerevisiae), or having a known role in mitosis and meiosis. Currently, WormBase includes 5534 interactions covering 15% of the C.elegans proteome. Users can view these interactions from the gene summary page.
This small but important dataset is from the Northeast Structural Genomics Consortium (http://www.nesg.org), which aims to produce 340 C.elegans targets. The primary targets of the Consortium focus on proteins of eukaryotic model organisms including S.cerevisiae and D.melanogaster in addition to C.elegans. Currently, structures for six proteins have been deposited in the Protein Data Bank (PDB) (http://www.rcsb.org/pdb/) (16). Detailed information about the status for these 340 C.elegans targets have been included in the WormBase and will be regularly updated.
KOGs are a eukaryote-specific version of the Conserved Orthologous Groups originally devised at the NCBI for microbial genomes (14). KOGs are defined by a triangle of reciprocal best BLASTP hits between domains of eukaryote proteins from highly divergent species (14). Over the last year, WormBase has incorporated these KOG annotations, together with other homology groups (14). Currently, WormBase carries 4852 KOGs, which includes the product of 9427 C.elegans protein-coding genes (i.e. 48% of all predicted protein-coding genes in WS129).
The backend database of WormBase is ACeDB (http://www.acedb.org) (4). During the last year, we have changed the way that a number of data types are represented in the database. These changes to the database schema do not affect usual users. However, advanced users who write scripts to access WormBase need to be aware of them. Significant model changes include the introduction of a unified Gene class (http://wormbase.org/db/misc/model?class=Gene), which holds all relevant information about a gene. Previously, such information was scattered among several interrelated classes. At the same time we have introduced CDS and Transcript classes to manage better the relationships between spliced transcripts and their products, and have significantly improved the derivation of transcript structures from cDNA and EST sequences.
Alongside these changes we have introduced stable anonymous identifiers for genes, of the form WBGene00006741, and for papers, of the form WBPaper0005637, in the same form as the person identifiers of the form WBPerson241. These identifiers track the various names that have been used for the corresponding entity and should be used where possible for database cross-referencing. The website supports URLs of the form http://www.wormbase.org/db/get?name=WBGene00006741;class=Gene. Questions about data models can be directed to gro.esabmrow@pleh-esabmrow.
The genome browser is a central component of WormBase that allows users to visualize gene model structures and their supporting evidence, as well as other features such as single nucleotide polymorphisms (SNPs), repetitive elements and experimental reagents. Over the last year, the browser has been enhanced in several ways: (i) scalable vector graphics (SVG) support. WormBase genome browser images have been widely used in presentations and publication illustrations (2,3,17), but their bitmapped nature leads to image degradation when printed at high resolution. We have recently added a facility that allows WormBase users to download specified genome browser images as SVG files (http://www.w3.org/TR/SVG/), which can be displayed, edited and printed at high resolution using SVG compatible software such as Adobe Illustrator 10. (ii) Feature highlighting. To assist location and visualization of features of interest, WormBase now highlights with a yellow background the feature that users have found in a search. This change is especially useful when users browse in large window size with multiple tracks turned on. (iii) Untranslated regions (UTRs). Both the internal data model and the visual display have now been modified to show the untranslated sections of transcripts, as well as internal splices that occur within the 5′- or 3′-UTRs. (iv) More feature tracks, including SNPs, SAGE tags, operon, poly(A) sites and predicted signal sequences. (v) DAS support. The genome browser may now be used as a viewer for Distributed Annotation System (DAS) (18) tracks, allowing users to superimpose their own annotations on WormBase tracks.
WormBase now maintains nucleotide-level alignments of ESTs, cDNAs and other sequences both within and between species. For example, the alignment between the C.elegans and C.briggsae genomes can be viewed both in a low-resolution view that emphasizes the relationship among a group of colinear genes (http://www.wormbase.org/db/seq/ebsyn?name=cb25.fpc0143:1..8000), or in a high-resolution text alignment view that shows differences in individual nucleotides. ESTs and cDNAs from C.elegans and other nematodes can be viewed in a multiple alignment view that highlights misalignments and gaps (http://www.wormbase.org/db/seq/aligner?name=WBGene00000423;class=Gene).
At the protein level, WormBase maintains a list of best BLAST matches to longest protein products from other important species including human (H.sapiens), mouse (Mus musculus), rat (Rattus norvegicus), fly (D.melanogaster), yeast (S.cerevisiae) and C.briggsae, which together can provide insights into the function of the related genes. All BLAST results are hyperlinked to a relevant entry in the respective model organism database or to Swiss-Prot/TrEMBL as appropriate. The multiple alignment display highlights conserved amino acid residues using a color code based on the chemical properties of the residues (Figure (Figure22).
Over the past year, we have added a WormBase site map (http://wormbase.org/db/misc/site_map) to provide an overview of the increasing number of web pages. Users can access this map directly from the navigation banner at the top of every WormBase page. The site map page lists all WormBase pages and provides users with different views. For example, users can choose ‘Detailed View’ to get brief overviews for individual pages before browsing the pages. And ‘Alphabetical View’ lists search pages in alphabetical order. Recently, WormBase has established a glossary page (http://dev.wormbase.org/db/misc/glossary) that lists definitions of common terms used throughout the site.
As biologists come to make more sophisticated use of large-scale datasets, there is an increasing need for a resource that is more than a point-and-click repository but provides data analysis and mining tools as well. This section briefly describes existing and recently introduced features that make WormBase suitable for data mining.
There are five different methods for accessing WormBase, each one suitable for a different set of purposes. Users can choose the most appropriate access methods according to their experience and needs.
As a sequence analysis platform, WormBase has made a large number of sequence analysis tools available to users. These tools include BLAST (20), BLAT (21), ePCR (22), coordinate mapper, EST aligner and protein aligner. In the past year, two new data mining tools, Textpresso (http://www.textpresso.org) (23), a literature search tool, and CisOrtho (24), a comparative cis-elements search tool have also been added to WormBase. Textpresso is a full text search engine, which gives researchers the ability to search the body of all WormBase literature holdings, which includes a substantial percentage of the C.elegans and C.briggsae literature. Currently, the Textpresso database holds 19985 curated documents, 4420 of which have full texts. These documents come from four major sources: (i) CGC papers. These are scientific journal articles maintained by the Caenorhabditis Genetics Center (http://biosci.umn.edu/CGC/CGChomepage.htm); (ii) Worm Meetings abstracts; (iii) Worm Breeders Gazette abstracts; and (iv) Miscellaneous. These are various other abstracts containing data about C.elegans and C.briggsae. Another useful feature of Textpresso is that it returns the sentences that contain the key words, with links to WormBase paper pages and PubMed pages.
CisOrtho (24) works by starting from a consensus binding site that is represented as a weight matrix. It identifies potential sites in a pre-filtered genome and then further filters by assessing conservation of the putative site in the genome of a related species, a process called phylogenetic footprinting. CisOrtho can be accessed at http://www.wormbase.org/cisortho/.
In the past, the WormBase fortnightly update policy presented a problem to researchers who published results based on mining WormBase because by the time their results were published the version of WormBase they based their analysis on had been superseded. To assist in making such research citable and reproducible, we have adopted a new policy in which every tenth WormBase release becomes a frozen release. Frozen releases are available in perpetuity on specially designated WormBase sites named http://ws100.wormbase.org, http://ws110.wormbase.org and so on. The first freeze was http://ws100.wormbase.org, released on May 10, 2003. The most recent freeze is http://ws130.wormbase.org, released on August 16, 2004. Researchers are encouraged to perform large-scale analyses on a frozen release and to cite the release number in their publications. Pointers to all freezes are displayed on the WormBase live site front page.
WormBase is a part of the GMOD project (25,26), a broad collaboration among the model organism databases to develop common vocabularies, data models, software tools and user interfaces applicable across all model organism community databases. As part of this project, WormBase provides sequence-similarity-based links between its gene pages and the gene pages of FlyBase (27), The Saccharomyces Genome Database (28,29), Ensembl (29) and Reactome (http://www.reactome.org). Links to RGD (30) and MGD (31) are planned.
Recently, the GMOD project has developed a common representation of genomic sequence features known as the Sequence Ontology (http://song.sourceforge.net), which facilitates exchange of genomic annotations among the various MODs and encourages the use of common analytic and visualization tools. GMOD participants are already using common software packages on their websites for visualizing genome annotations, drawing genetic maps and searching the literature, and this convergence will be enhanced in the near future as the MODs move towards a unified gene page.
WormBase has evolved from ACeDB (http://www.acedb.org), to a database which encompass literature curation and biology of C.elegans (4), and recently to a database housing the biology and genomic data of multiple nematode species (2,3). WormBase is still a work in progress. On the user interface front, future enhancements include WormMart, which is based on BioMart, an advanced query and report generation system first developed for use with Ensembl (32). On the data front, we are looking forward to the genome sequencing and annotation of three more nematode species (http://genome.gov/page.cfm?pageID=10002154), bringing up to five the number of Caenorhabditis genomes maintained by WormBase. During 2005, WormBase plans to introduce a browser for nematode intermediate metabolism and higher-order biological pathways. The pathway browser and the underlying dataset will be developed in collaboration with the Reactome and MetaCyc (http://metacyc.org/) (33) projects. Together these will provide an unparalleled resource for dissecting functional elements in the Caenorhabditis genomes and provide valuable insights into the evolution and biological adaptations of these organisms.
The WormBase Consortium will continue to address issues raised by WormBase users, maintaining both a simple and friendly user interface while adding further search and research tools to enable WormBase's evolution from a data repository into a resource for all biologists to use in order to maximize the value of model organism research in C.elegans and its relatives.
As always, we welcome comments, questions, corrections and data submissions (gro.esabmrow@pleh-esabmrow).
P.W.S. is an Investigator with the Howard Hughes Medical Institute. We thank Sheldon McKay and Kris Gunsalus for critical reading of the manuscript. WormBase is supported by grant P41-HG02223 from the US National Human Genome Research Institute and the British Medical Research Council.