GenBank assigns sequence records to various divisions based either on the source taxonomy or the sequencing strategy used to obtain the data. There are 12 ‘taxonomic’ divisions that correspond roughly to the source organisms of the sequence data (BCT, ENV, INV, MAM, PHG, PLN, PRI, ROD, SYN, UNA, VRL, VRT) and 8 ‘functional’ divisions (EST, GSS, HTC, HTG, PAT, STS, TSA, WGS) that collect sequences generated by a particular method. The size and growth of these divisions, and of GenBank as a whole, are shown in .
Growth of GenBank divisions (nucleotide base pairs)
Database sequences are classified and can be queried using a comprehensive sequence-based taxonomy (www.ncbi.nlm.nih.gov/taxonomy/
) developed by NCBI in collaboration with EMBL-Bank and DDBJ and with the valuable assistance of external advisers and curators (5
). Almost 260 000 formally described species are represented in GenBank, and the top species in the non-WGS GenBank divisions are listed in .
Top organisms in GenBank (Release 191)
Sequence identifiers and accession numbers
Each GenBank record, consisting of both a sequence and its annotations, is assigned a unique identifier called an accession number that is shared across the three collaborating databases (GenBank, DDBJ, EMBL-Bank). The accession number appears on the ACCESSION line of a GenBank record and remains constant over the lifetime of the record, even when there is a change to the sequence or annotation. Changes to the sequence data are tracked by an integer extension of the accession number, and this Accession.version identifier appears on the VERSION line of the GenBank flat file. Other changes, such as revised annotations or additions of publications, that do not affect the sequence data will not result in a new version number. The initial version of a sequence has the extension ‘.1’. In addition, each version of the DNA sequence is also assigned a unique NCBI identifier called a GI number that also appears on the VERSION line following the Accession.version:
VERSION AF000001.5 GI: 7274584
Each GI number corresponds to a unique Accession.version identifier. When a change is made to a sequence in a GenBank record, a new GI number is issued to the updated sequence and the version extension of the Accession.version identifier is incremented. The accession number for the record as a whole remains unchanged, and will always retrieve the most recent version of the record; the older versions remain available under the old Accession.version identifiers and their original GI numbers. The Revision History report, available from the ‘Display Settings’ menu on the sequence record view, summarizes the various updates for that GenBank record, both those that resulted in a new version (updates to sequence data) and those that did not (updates to non-sequence data).
A similar system tracks changes in the corresponding protein translations. These identifiers appear as qualifiers for coding sequence (CDS) features in the FEATURES portion of a GenBank entry, e.g. /protein_id=‘AAF14809.1’. Protein sequence translations also receive their own unique GI number, which appears as a second qualifier on the CDS feature:
/db_xref = 'GI:6513858'
Citing GenBank records
Besides being the primary identifier of a GenBank sequence record, GenBank accessions are also the most efficient and reliable way to cite a sequence record in publications. We certainly encourage submitters and other authors to cite GenBank data using these accessions. However, as discussed above, since searching with a GenBank accession number will retrieve the most recent version of the sequence data for a record, the sequence data returned from such searches will change over time if the record is updated. It is quite possible, therefore, for the sequence data retrieved today by an accession to be different from that discussed or analysed in an article published several years ago. We therefore recommend that authors include the version suffix when citing a GenBank accession (e.g. AF000001.5), particularly in cases where the sequence coordinates are critical to the work being described.