|Home | About | Journals | Submit | Contact Us | Français|
The EMBL Nucleotide Sequence Database (http://www.ebi.ac.uk/embl) at the EMBL European Bioinformatics Institute, UK, offers a large and freely accessible collection of nucleotide sequences and accompanying annotation. The database is maintained in collaboration with DDBJ and GenBank. Data are exchanged between the collaborating databases on a daily basis to achieve optimal synchrony. Webin is the preferred tool for individual submissions of nucleotide sequences, including Third Party Annotation, alignments and bulk data. Automated procedures are provided for submissions from large-scale sequencing projects and data from the European Patent Office. In 2006, the volume of data has continued to grow exponentially. Access to the data is provided via SRS, ftp and variety of other methods. Extensive external and internal cross-references enable users to search for related information across other databases and within the database. All available resources can be accessed via the EBI home page at http://www.ebi.ac.uk/. Changes over the past year include changes to the file format, further development of the EMBLCDS dataset and developments to the XML format.
The EMBL Nucleotide Sequence Database is the European node of the International Nucleotide Sequence Database Collaboration (INSDC, http://www.insdc.org/) between DDBJ (1), EMBL and GenBank (2). The collaborative aim is to collect and present nucleotide sequence and annotation as comprehensively as possible.
The EMBL Nucleotide Sequence Database (EMBL) is maintained at the European Bioinformatics Institute, which hosts several other core biological databases (3).
The main goal of the EMBL Nucleotide Sequence Database is to accept, process and make freely available sequence data from individual researchers, research groups and the European Patent Office (EPO). Collected nucleotide sequences and accompanying annotation are made available via the EBI Sequence Retrieval System (SRS), ftp, web services and similarity search tools.
EMBL database releases, with accompanying release notes, are produced quarterly.
The database is presented as individual entries, each carrying sequence or information on sequence construction, submission information (submission and update dates, version numbers and submitter details), literature citations and annotation in the form of a feature table. Full details of database flatfile format are available in the user manual. Details of feature table format are available in the INSDC Feature Table Definition. Data are also presented in XML formats via the web tools, dbfetch and ftp.
Each entry in the database belongs to one of the several entry types, which differ in either data format or handling of data by the database. Entry types include standard (STD), constructed (CON), third party annotation (TPA), whole genome shotgun (WGS), annotated constructed (ANN) and mass genome annotation library (MGA). New entry types are created as new types of data arrive at the database.
Over the past year, the size of the EMBL Nucleotide Sequence Database has increased from 58.7 million entries in Release 84, September 2005 to 80.5 million entries in Release 88, September 2006, of which 18 million entries are WGS data. The WGS entries now account for >50% of the nucleotide content of the database—80.3 Gbp out of 146.5 Gbp in September 2006. There are now over 260000 organisms represented in the database.
During the last year, an important EMBL flatfile format change was completed and there were further developments to XML formats, XML distribution and tools and the TPA dataset.
EMBL database submission procedures are briefly described below. Full details of procedures are available at http://www.ebi.ac.uk/embl/Submission/
Webin is the preferred submission system for nucleotide sequence and biological annotation. Webin has been designed to allow rapid submission of single, multiple or very large numbers of sequences (bulk data) and is available at http://www.ebi.ac.uk/embl/Submission/webin.html. Bulk data submission in the fasta format is possible via Webin, where the fasta format is sufficient to describe all differences between submitted entries in terms of sequence and annotation fields.
TPA submissions are accepted via Webin; a modification of Webin is also available that is able to accept alignment submissions for inclusion into the EMBL-Align dataset (4). This service is available at http://www.ebi.ac.uk/embl/Submission/align_top.html.
Database entries produced at sequencing sites can be deposited and updated directly by the submitters using FTP or email. Groups producing and updating large volumes of genome sequence data, including WGS, over an extended period of time are advised to contact the database at ku.ca.ibe@sbusatad.
Sequence data extracted from biotechnology patent application submissions to the EPO are received, processed and made available weekly in the EMBL Nucleotide Sequence Database. A stable link between the patent document number, the sequence number within the document and the accession number is maintained. The EMBL Nucleotide Sequence Database processes both nucleotide and protein sequences from the EPO, but the distribution methods, collaborative data exchange mechanisms and exchange frequency for protein sequences differ from those of nucleotide sequences.
All new and updated database records are exchanged on a daily basis between EMBL, DDBJ and GenBank. WGS datasets are exchanged when they become available or have been updated and the rest of the data are exchanged daily. In addition to data exchange, lists of accession numbers are exchanged weekly to achieve maximum synchrony in data availability at all three sites.
Main access method to EMBL Nucleotide Sequence Database data is SRS (5,6); the FTP server, homology search tools, the Genomes web server (for completely sequenced genomes) and sequence retrieval by accession number (Dbfetch, Wsdbfetch and netserv) are also available (7). Access to all versions, current and historical, of EMBL Nucleotide Sequence Database entries including CON, TPA and WGS data are available via the Sequence Version Archive, SVA (8).
In addition to these facilities that offer a range of ways to search and download data, there are several sites that mirror EMBL Nucleotide Sequence Database data, which provide distributed ftp access.
Since release 87 (JUN-2006) the format of the EMBL flat file has undergone a change: the ID line now has a different structure (see below) and the SV line has been removed.
The changes to the ID line structure were as follows:
All tokens are separated by a semicolon, the entry name is not displayed (in its place there will be the primary accession number), the sequence version is indicated in the ID line, the topology is a distinct token and is indicated for both circular and linear molecules and both the data class and the taxonomic divisions are displayed.
Below is an example of the new ID line:
The tokens represent:
 Primary accession number;  ‘SV’ + sequence version number;  Topology: ‘circular’ or ‘linear’;  Molecule type;  Data class: ANN, CON, PAT, EST, GSS, HTC, HTG, MGA, WGS, TPA, STS, STD, ‘normal’ entries have have ‘STD’ for ‘standard’;  Taxonomic division: HUM, MUS, ROD, PRO, MAM, VRT, FUN, PLN, ENV, INV, SYN, UNC, VRL, PHG';  Sequence length + ‘BP.’
An explanation of dataclass and taxonomic division, represented in the ID line by three-letter abbreviation, is available in the release notes.
The entry name is no longer displayed in the ID line. Since EMBL release 3 (December 1983), the stable identifier for an entry has been the primary accession number.
A mapping file (deprecated entry name to accession number) was provided via the ftp server for those entries where the entry name did not coincide with the accession number at the point of change.
Two other changes that are linked to the ID line change, both related to the way the data are represented on the ftp server: release data and the cumulative file (file containing all the data that are created or updated since the last release) are split into smaller files according to data class and taxonomic division. Full details on the way in which data are split on the ftp are available in the ftp directories and in the release notes.
In the past year, INSDC-specific XML was developed further; in spring 2006, the decision was taken to stabilize the production version of the DTD in order to facilitate external developments based on it. The current production version of the XML is INSDSeq v1.4 and can be obtained from http://www.insdc.org/documents.html.
Development of the EMBL-specific EMBLXML has continued and has been extended to EMBLCDS dataset. CDS are now distributed via the ftp server in the XML format in addition to the flatfile distribution. To support further the external use of the INSDC and EMBL XML formats, a web-based tool for instantaneous conversions between each XML and flatfile formats has been created.
The EMBLCDS dataset was created in response to user requests for whole database dumps of coding sequence. EMBLCDS is now offered as a dataset updated daily, available by anonymous FTP, via SRS and via sequence similarity searches. There are currently 5.4 million EMBLCDS entries and 4.8 million items in the non-redundant EMBLCDSnr. To produce the non-redundant dataset, sequence checksums are used to collapse sequences with the same checksum into a single record.
Over the past year, several ways of grouping entries within the EMBLCDS dataset, apart from the grouping by checksum, were introduced: groups by gene name, by species and by shared exons. Grouping indices are available from the ftp server and are used in SRS views to link related records together.
As mentioned earlier in the ‘XML development’ section, EMBLXML has been extended to cover data from the EMBLCDS dataset.
In 2005, the International Nucleotide Sequence Database Collaboration introduced the lat_lon (latitude-longitude) qualifier. The qualifier allows submitters to specify precisely where the sequenced specimen was collected. The data collected so far can now be seen plotted on the world map at http://www3.ebi.ac.uk/Services/EMBLWorld/EMBLWorld.pl (Figure 1).
The EMBL Nucleotide Sequence Database continued to extend the number and diversity of its cross-references to other databases. The number of cross-referenced databases was 27 in the September 2006 release and the number of individual cross-references was over 62 million.
Cross-referenced databases include UniProt (9), InterPro (10), GOA (11) and a few other major databases, along with more specific databases. The cross-referenced database GeneDB (http://www.genedb.org/), for example, holds the latest sequence data and annotation for organisms sequenced by the PSU (Pathogen Sequencing Unit) at The Wellcome Trust Sanger Institute.
‘Intradatabase’ cross-references where introduced in December 2005 and are internal to the EMBL database. They include EMBL-TPA, EMBL-ANN, EMBL-CON, EMBL-ALIGN and EMBL-JOIN and show some of relationships between the entries in the database that are otherwise difficult for users to infer; for example, EMBL-TPA cross-reference:
will appear in a standard entry that serves as primary source for a TPA entry BN000249. Explanation for each type of the intradatabase cross-reference is given in the EMBL database release notes.
TPA records are submitted to the International Nucleotide Sequence Databases as part of the process of publishing biological studies that include the annotation of existing nucleotide sequences in the primary sequence database. Over the past year, the TPA dataset was divided into two tiers, TPA:experimental and TPA:inferential to distinguish between annotation supported by wet laboratory experimental evidence and inferred annotation, where the source molecule or its products have not been the subject of direct experimentation (12).
In order to enable users to see evidence for a particular annotation and make an informed judgment about its validity, the evidence tagging system was improved over the year. In place of the old qualifier ‘evidence’, two new qualifiers, ‘experiment’ and ‘inference’ were introduced in the course of the year. ‘Experiment’ value is a free text naming the experimental techniques used; ‘inference’ is a highly structured qualifier that details how the annotation was inferred. The structure of the qualifier is
TYPE[ (same species)][:EVIDENCE_BASIS]
where TYPE is one of the following:
The optional text ‘(same species)’ can be included when the inference comes from the same species as the entry.
The optional ‘EVIDENCE_BASIS’ is either a reference to a database entry (including accession and version) or an algorithm (including version), e.g. ‘INSD:AACN010222672.1’, ‘InterPro:IPR001900’, ‘ProDom:PD000600’, ‘Genscan:2.0’, etc.
A complete list of all features and qualifiers is available at http://www.ebi.ac.uk/embl/WebFeat/index.html.
The new evidence tagging system described above have been available since December 2005 and has at the time of writing been applied in 1662 entries, with over 145 000 instances of the new qualifiers containing meaningful values (i.e. containing values different from “[non-] experimental evidence, no additional details recorded”).
Funding to pay the Open Access publication charges for this article was provided by EMBL.
Conflict of interest statement. None declared.