The breadth of information of interest to ENA is ever expanding both in terms of novel technologies for the resolution of nucleotide information and the applications to which the information is put. We have adjusted our overall view of nucleotide sequence archiving and have abstracted somewhat from underlying legacy infrastructure, such that sequencing information is classed as ‘reads’ (sequencing machine output—traces, flowgrams, etc.—base calls and quality scores), ‘assembly’ (information relating overlapping fragmented sequence reads to contigs and covering higher order structures where contigs are structured into representations of complete biological molecules, such as chromosomes) and ‘annotation’, where interpretations of biological function are projected onto coordinate-defined regions of assembled sequence in the form of annotation (). In all cases, the core information is provided solely by submitters and is only updated by submitter interaction. This is in sharp contrast to the information in other databases, such as Ensembl, which provide a community view of the information provided. Associated with read, assembly and annotation information is information relating to the provenance and treatment of biological samples used for sequencing. In this scheme, the INSDC component contributes information to both ENA-Assembly and ENA-Reads. Where possible, data in ENA-Annotation, ENA-Assembly and ENA-Reads are connected in a single integrated system, such that links can be made between data objects in each of the components (e.g. annotation on highly assembled sequenced can be tracked back to underlying contigs and capillary trace data that support a particular assembly can be retrieved).
Figure 1. ENA structure. The figure shows how nucleotide sequencing information is partitioned according to class; ENA-Reads treats raw sequencing information, ENA-Assembly treats information on how fragmented sequences have been assembled into higher order structures (more ...)
As an archival repository, the primary information stored and presented is derived from the submitting parties; ownership, and hence editorial control, of primary content, remains with the original submitting group. However, an archive of such size and diversity clearly requires sensible organization of data for management purposes and end-user utility (such as search and visualization) and integration with the many other tools and data resources available at EBI and beyond. Such data organization and integration require active mapping maintenance between ENA objects and objects in remote resources. Developments in these integration pipelines are discussed subsequently.
The ENA achieves comprehensive coverage of the world's nucleotide sequencing data through a number of active collaborations, most notably with DDBJ (3
) and GenBank (4
), though the INSDC (The International Nucleotide Sequence Database Collaboration, http://www.insdc.org
) and through trace collaborations with the Wellcome Trust Sanger Institute and the trace archive at NCBI (5
). As part of our drive to improve the utility of archived data, we are active in the development of a number of formats and standards, including MIGS (6
), CBoL BARCODE data standard (http://barcoding.si.edu/PDF/DWG_data_standards-Final.pdf
) and MINSEQE (http://www.mged.org/minseqe/MINSEQE.pdf
ENA provides comprehensive submission tools and services, permanent archiving of content and a multitude of data access resources. Points of entry into ENA services are detailed in .
Points of entry to the ENA: submissions, retrieval and support