|Home | About | Journals | Submit | Contact Us | Français|
GeneDB (http://www.genedb.org/) is a genome database for prokaryotic and eukaryotic organisms. The resource provides a portal through which data generated by the Pathogen Sequencing Unit at the Wellcome Trust Sanger Institute and other collaborating sequencing centres can be made publicly available. It combines data from finished and ongoing genome and expressed sequence tag (EST) projects with curated annotation, that can be searched, sorted and downloaded, using a single web based resource. The current release stores 11 datasets of which six are curated and maintained by biologists, who review and incorporate information from the scientific literature, public databases and the respective research communities.
The Pathogen Sequencing Unit (PSU) at the Wellcome Trust Sanger Institute sequences a large number of diverse prokaryotic and eukaryotic genomes (http://www.sanger.ac.uk/Projects/). In recent years, new sequencing and assembly technologies and collaborations between sequencing institutes have dramatically increased both the output and quality of sequence data. Maintenance and dissemination of such data required the development of an integrated, publicly accessible database.
During GeneDB’s development, four key points have been taken into consideration. GeneDB must be capable of storing and frequently updating sequences and annotations, irrespective of the status of the sequencing project. The resource should therefore support both the mining of preliminary datasets for gene discovery and the viewing of finished sequence data. Secondly, an intuitive user interface, which provides rapid access, visualization, searching and downloading of data, must be shared between the datasets. Thirdly, the database architecture should allow integration of diverse biological datasets with the sequence. Lastly, the use of structured vocabularies would ensure standardization, facilitating querying and comparison between species.
Currently, GeneDB houses the sequences and associated annotation of 11 organisms, including members of the bacteria, fungi, protozoa and arthropods. Of these, six are finished genomes and five are ongoing sequencing projects (Table (Table11).
GeneDB stores sequence data and analyses generated via automated annotation pipelines, prior to manual annotation. The analysis pipelines include gene finding algorithms, protein feature predictions, BLAST and/or FASTA searches against nucleotide, protein and customized databases, protein domain and/or family search results and electronically inferred and manually revised gene ontology associations (GO) (1). Search results are continually reviewed during the curation process and complemented by additional datasets (Fig. (Fig.11).
The sequence and annotation files are processed by the GeneDB mining code, generating both the GeneDB Java objects and standardized files used to populate a Genomics Unified Schema (GUS) database (http://www.gusdb.org/). A set of data files including FASTA sequence files for third-party tools, such as BLAST, is also produced. Both the mining code and the GeneDB object layer take advantage of available code from the BioJava project. Access to the GeneDB data through the GeneDB website is provided by a set of servlets and Java Server Pages (JSP).
The GeneDB homepage supplies links to the individual organism homepages. From these, researchers can take advantage of numerous ways to retrieve data and construct searches according to individual preferences and requirements. Clickable chromosome and contig maps, searchable text indices and browsable catalogues [GO assignments (1), descriptions, products, domains] provide fast and easy access. An additional query interface supports a wide range of queries on sequences and (curated) annotations stored in GUS, with the ability to combine searches with the Boolean operators AND and OR. For example, users can select all proteins of a specified length range with a specified number of introns. Other query options include GO assignments, keywords, chromosome, protein domains and predicted protein sequence features. The queries in each session are tracked via a history page, allowing further refinement of searches and downloading of results as a nucleotide or amino acid FASTA file. Furthermore, a variety of sequence similarity search facilities are available through GeneDB. In addition to WU-BLAST, GeneDB also supports omniBLAST, which permits searching across a set of selectable databases. An iterative BLAST (PSI-BLAST) search suited to the identification of distant homologues is envisaged to be available shortly. Peptide sequences can be searched with either user-specified motifs or using the peptide mass identification tool EMOWSE, part of the suit of EMBOSS open-source software tools (2). An alternative approach for accessing genes of interest is to use the official browser of the GO consortium, AmiGO. Several different methods are available for querying the data both externally (http://www.godatabase.org/) and internally via GeneDB, all of which include direct links to the gene pages.
Feature pages, generated for coding sequences, display basic location information and a context map. The results of protein feature prediction algorithms [SignalP V2.0 (3), TMHMM v2.0 (4), GPI anchor predictions (http://184.108.40.206/dgpi/index_en.html)] and the manual annotation and curation processes are provided in both a graphical display and text format (Fig. (Fig.2).2). This information is complemented by the results of similarity searches, including the display of predicted and experimentally characterized orthologues and paralogues. Additional sequence features, both at the DNA level (e.g. polymorphisms, introns, UTRs, splice donor and acceptor sequences) and protein level (e.g. peptide domains), can be viewed in the context of the annotated sequence via an Artemis applet (5) (Fig. (Fig.2).2). The selected region can also be downloaded either in FASTA or annotated EMBL file format. Sequence data, either of the predicted coding sequence or the clustered ESTs, are accessible via a secondary page.
Extensive cross-referencing supports retrieval of related information from external resources, allowing rapid transfer between databases. This includes reciprocal links to numerous databases housing nucleotide and protein sequences [e.g. EMBL (6), Swiss-Prot/TrEMBL (7)], pathways [KEGG (8)], protein families [e.g. SCOP (9), Pfam (10), InterPro (11)], ontologies [e.g. GO (1)], expression data [e.g. microarray (http://www.sanger.ac.uk/perl/SPGE/geexview)], strain information [FYSSION (http://pombe.biols.susx.ac.uk)] and phenotype data [e.g. the Trypanosoma brucei RNAi project (http://www.TrypanoFAN.org/)]. Links to databases housing the same genome at different sites [e.g. SGD (12), TGAD (http://www.tigr.org/tdb/e2k1/tba1/tba1.shtml)] are also provided. These links to external resources are validated and updated on a monthly basis by the GeneDB mining code. Annotators and curators are automatically alerted to inconsistencies in the datasets and changed GO identifiers.
Experienced biologists curate data for six of the organisms in GeneDB (Table (Table1).1). Such curation involves several aspects, all aiming to facilitate data querying and retrieval. First as a number of organisms are sequenced by more than one sequencing centre, it ensures consistent annotation across the whole of the respective genome. Secondly, sequences and their annotation are updated according to new submissions to public databases, publications and contributions by the wider scientific community (Fig. (Fig.1).1). Public information is used not only to verify existing gene models and annotations but also to add value, enabling users to retrieve groups of genes/proteins not possible by purely computational methodologies. Wherever possible, controlled vocabularies such as GO (1) are used. In the absence of such vocabularies, statements are captured in structured syntax either in the description lines (Fig. (Fig.2)2) or in a dedicated curation field, providing a concise summary of the major aspects of a gene’s biology. Links are provided to PubMed records and other resources used to compile the statements. Text indices point to other products sharing the same description line.
Curators regularly exchange information and updates with the public databases, aiming to synchronize these datasets globally. Finally, the curation of related species (e.g. Plasmodium species, the Kinetoplastida) at one site enables extensive cross-referencing and comparative analyses between species, such as the inclusion of experimentally verified and predicted orthologues.
For three organisms (Schizosaccharomyces pombe, Leishmania major and Trypanosoma brucei), GeneDB curators are also involved in implementing nomenclature guidelines (13) and resolving nomenclature conflicts, ensuring accurate and complete retrieval of information.
GeneDB code development will continue to concentrate on integrating GeneDB with the GUS schema. Part of this development is a collaboration with the GUS team at the Computational Biology and Informatics Laboratory (CBIL, University of Pennsylvania) to design a common web interface architecture, permitting the creation of customized web pages to suit individual database requirements [e.g. GeneDB, PlasmoDB (14), AllGenes (http://www.allgenes.org/)].
Curation will be extended to integrate expression, phenotypic and interaction data. To this extent, GeneDB curators and developers have collaborated with the GUS team to implement modifications to the GUS schema in preparation for the incorporation of these large-scale biological data. Also, with the increasing emphasis on genomics projects of related organisms, the GeneDB team are already designing tools for comparative analyses that can be readily displayed via the web.
Furthermore, it is intended to substantially expand the available bacterial datasets to include all the bacterial genomes completed and published by the PSU (see http://www.sanger.ac.uk/Projects/Microbes/ for a comprehensive list).
We would like to thank the CBIL team, in particular Jonathan Crabtree, Steve Fischer, Jonathan Schug and Chris Stoeckert. We would also like to thank collaborating sequencing centres, in particular, Najib El-Sayed at The Institute for Genomic Research, Peter Myler and Ken Stuart at the Seattle Biomedical Research Institute, Adam Kuspa at Baylor College of Medicine, Angelika Noegel at the University of Cologne, Michel Veron at the Institut Pasteur, and Mike Lehane at the University of Wales, Bangor for sharing unpublished data as well as the numerous researchers who have contributed to the annotation of datasets in GeneDB. GeneDB is funded by the Wellcome Trust through its support of the Sanger Institute. GUS was developed by CBIL. GeneDB and the Centre for Tropical and Emerging Global Diseases, University of Georgia have made significant contributions to it as part of an ongoing collaborative effort with CBIL to further develop the schema.