Sequence length limits
Currently, database records are limited in length to 350 000 bp. At the DDBJ/EMBL/GenBank collaborative meeting of May 2003, a decision was taken to remove the size restriction on database records in June 2004.
This development will allow the entire sequence derived from a naturally occurring biological unit to be stored as a single database entry, thus eliminating the need to split long sequences into segments and create CON entries to store the assembly information (19
). Currently, ~3% of all base pairs in the database are stored in the constituent segment entries of CON entries.
Third Party Annotation (TPA) data set
Until recently, the collaborative databases have collected and distributed only primary nucleotide sequence and annotation data resulting from direct sequencing of such molecules as cDNAs, ESTs and genomic DNA. ‘Primary data’ is defined as annotated sequence that has been determined by submitters and their teams. Primary database entries remain in the ownership of the original submitter and the co-authors of the submission publication(s). The owners of database entries have privileges to implement updates to the data.
In response to demand from the research community, the collaborative databases have created the TPA data set. The types of data that make up the TPA data set include reannotations of existing entries, combinations of novel sequence and existing primary entries and annotation of trace archive and WGS data.
TPA data are submitted using Webin. Submitters are required to provide DDBJ/EMBL/GenBank accession and version numbers and nucleotide locations for all primary entries to which their TPA entry relates. For TPA sequences composed from trace archive data, the identifier (e.g. TI123445566) and corresponding nucleotide locations must be provided.
TPA entries can be distinguished easily from their primary counterparts. The abbreviation ‘TPA:’ appears at the beginning of each description (DE) line and the keywords ‘Third Party Annotation’ and ‘TPA’ appear in the keyword (KW) line.
AH TPA_SPAN PRIMARY_IDENTIFIER PRIMARY_SPAN COMP
AS 1–251 BE529226.1 1–251
AS 68–450 BE524624.1 1–383
AS 394–1086 AJ420881.1 1–693
AS 826–1211 AV561543.1 1–386
The flat-file extract shown above (from BN000024) shows the two new line types that have been created for TPA entries. The Assembly Header (AH) line provides column headings for the assembly information. The Assembly (AS) lines provide information on the composition of the TPA sequence by listing base span(s) of the TPA sequence together with identifiers and base span(s) of contributing sequences.
In order to ensure sequence annotation of the highest quality, entries that are yet to be discussed in peer-reviewed publications are held confidential and are not visible to database users. This is an important difference from our policy of data release for primary entries.
EMBL Sequence Version Archive (SVA)
The EMBL SVA (20
) was created to provide access to all versions of EMBL Nucleotide Sequence Database entries, including CON, TPA and WGS data. There were 145 million entry versions in the archive by September 2003, and new versions are being added every day. Entries from all past EMBL Nucleotide Sequence Database releases, starting with the first release in 1982, have been loaded into the archive.
Each time an EMBL database entry is created or modified it is loaded into the archive, where it can be accessed and compared with other versions of the same entry. If an entry is updated, corrected or extended as a result of new findings from recent experiments, the entry version is incremented. Changes in the taxonomic lineage, or flat-file formatting changes are not reflected in the entry version. For this reason, the archive may contain several variants of an entry with the same entry version number.
Entries can be retrieved interactively using accession numbers, protein identifiers and sequence versions. The user chooses to view either the complete chronological history of an entry or the entry version that was current at a specified date. The resulting entry versions can be viewed, downloaded and compared. The interactive interface can also be reached by following hyperlinks from the EBI SRS query results page when working with EMBL Nucleotide Sequence Database and EMBL-Align entries. As an example of programmatic entry retrieval, the following URL returns the latest EMBL entry having the accession number AC067752: http://www.ebi.ac.uk/cgi-bin/dbfetch?db=SVA&id=AC067752&format=default
XML format for data exchange
The EMBL Nucleotide Sequence Database has initiated efforts to produce an XML format for the distribution of entries. The development of this format will be carried out in collaboration with DDBJ and GenBank with the aim of developing a common representation for the distribution of data.