Important changes to the flatfile format
Since release 87 (JUN-2006) the format of the EMBL flat file has undergone a change: the ID line now has a different structure (see below) and the SV line has been removed.
The changes to the ID line structure were as follows:
All tokens are separated by a semicolon, the entry name is not displayed (in its place there will be the primary accession number), the sequence version is indicated in the ID line, the topology is a distinct token and is indicated for both circular and linear molecules and both the data class and the taxonomic divisions are displayed.
Below is an example of the new ID line:
The tokens represent:
 Primary accession number;  ‘SV’ + sequence version number;  Topology: ‘circular’ or ‘linear’;  Molecule type;  Data class: ANN, CON, PAT, EST, GSS, HTC, HTG, MGA, WGS, TPA, STS, STD, ‘normal’ entries have have ‘STD’ for ‘standard’;  Taxonomic division: HUM, MUS, ROD, PRO, MAM, VRT, FUN, PLN, ENV, INV, SYN, UNC, VRL, PHG';  Sequence length + ‘BP.’
An explanation of dataclass and taxonomic division, represented in the ID line by three-letter abbreviation, is available in the release notes.
The entry name is no longer displayed in the ID line. Since EMBL release 3 (December 1983), the stable identifier for an entry has been the primary accession number.
A mapping file (deprecated entry name to accession number) was provided via the ftp server for those entries where the entry name did not coincide with the accession number at the point of change.
Two other changes that are linked to the ID line change, both related to the way the data are represented on the ftp server: release data and the cumulative file (file containing all the data that are created or updated since the last release) are split into smaller files according to data class and taxonomic division. Full details on the way in which data are split on the ftp are available in the ftp directories and in the release notes.
In the past year, INSDC-specific XML was developed further; in spring 2006, the decision was taken to stabilize the production version of the DTD in order to facilitate external developments based on it. The current production version of the XML is INSDSeq v1.4 and can be obtained from http://www.insdc.org/documents.html
Development of the EMBL-specific EMBLXML has continued and has been extended to EMBLCDS dataset. CDS are now distributed via the ftp server in the XML format in addition to the flatfile distribution. To support further the external use of the INSDC and EMBL XML formats, a web-based tool for instantaneous conversions between each XML and flatfile formats has been created.
The EMBLCDS dataset was created in response to user requests for whole database dumps of coding sequence. EMBLCDS is now offered as a dataset updated daily, available by anonymous FTP, via SRS and via sequence similarity searches. There are currently 5.4 million EMBLCDS entries and 4.8 million items in the non-redundant EMBLCDSnr. To produce the non-redundant dataset, sequence checksums are used to collapse sequences with the same checksum into a single record.
Over the past year, several ways of grouping entries within the EMBLCDS dataset, apart from the grouping by checksum, were introduced: groups by gene name, by species and by shared exons. Grouping indices are available from the ftp server and are used in SRS views to link related records together.
As mentioned earlier in the ‘XML development’ section, EMBLXML has been extended to cover data from the EMBLCDS dataset.
Access to the data by map
In 2005, the International Nucleotide Sequence Database Collaboration introduced the lat_lon (latitude-longitude) qualifier. The qualifier allows submitters to specify precisely where the sequenced specimen was collected. The data collected so far can now be seen plotted on the world map at http://www3.ebi.ac.uk/Services/EMBLWorld/EMBLWorld.pl
There are three levels of zoom to the map to allow viewing at greater magnification. Using the same geographical information, SRS views of EMBL entries link data to googlemaps.
The EMBL Nucleotide Sequence Database continued to extend the number and diversity of its cross-references to other databases. The number of cross-referenced databases was 27 in the September 2006 release and the number of individual cross-references was over 62 million.
Cross-referenced databases include UniProt (9
), InterPro (10
), GOA (11
) and a few other major databases, along with more specific databases. The cross-referenced database GeneDB (http://www.genedb.org/
), for example, holds the latest sequence data and annotation for organisms sequenced by the PSU (Pathogen Sequencing Unit) at The Wellcome Trust Sanger Institute.
‘Intradatabase’ cross-references where introduced in December 2005 and are internal to the EMBL database. They include EMBL-TPA, EMBL-ANN, EMBL-CON, EMBL-ALIGN and EMBL-JOIN and show some of relationships between the entries in the database that are otherwise difficult for users to infer; for example, EMBL-TPA cross-reference:
will appear in a standard entry that serves as primary source for a TPA entry BN000249. Explanation for each type of the intradatabase cross-reference is given in the EMBL database release notes.
Further development of the TPA dataset
TPA records are submitted to the International Nucleotide Sequence Databases as part of the process of publishing biological studies that include the annotation of existing nucleotide sequences in the primary sequence database. Over the past year, the TPA dataset was divided into two tiers, TPA:experimental and TPA:inferential to distinguish between annotation supported by wet laboratory experimental evidence and inferred annotation, where the source molecule or its products have not been the subject of direct experimentation (12
Enhanced evidence system
In order to enable users to see evidence for a particular annotation and make an informed judgment about its validity, the evidence tagging system was improved over the year. In place of the old qualifier ‘evidence’, two new qualifiers, ‘experiment’ and ‘inference’ were introduced in the course of the year. ‘Experiment’ value is a free text naming the experimental techniques used; ‘inference’ is a highly structured qualifier that details how the annotation was inferred. The structure of the qualifier is
TYPE[ (same species)][:EVIDENCE_BASIS]
where TYPE is one of the following:
- ‘non-experimental evidence, no additional details recorded’
- ‘similar to sequence’
- ‘similar to AA sequence’
- ‘similar to DNA sequence’
- ‘similar to RNA sequence’
- ‘similar to RNA sequence, mRNA’
- ‘similar to RNA sequence, EST’
- ‘similar to RNA sequence, other RNA’
- ‘nucleotide motif’
- ‘protein motif’
- ‘ab initio prediction’
The optional text ‘(same species)’ can be included when the inference comes from the same species as the entry.
The optional ‘EVIDENCE_BASIS’ is either a reference to a database entry (including accession and version) or an algorithm (including version), e.g. ‘INSD:AACN010222672.1’, ‘InterPro:IPR001900’, ‘ProDom:PD000600’, ‘Genscan:2.0’, etc.
The new evidence tagging system described above have been available since December 2005 and has at the time of writing been applied in 1662 entries, with over 145 000 instances of the new qualifiers containing meaningful values (i.e. containing values different from “[non-] experimental evidence, no additional details recorded”).