The data in GenBank and the collaborating databases, EMBL and DDBJ, are submitted primarily by individual authors to one of the three databases, or by sequencing centers as batches of EST, STS, GSS, HTC, WGS or HTG sequences. Data are exchanged daily with DDBJ and EMBL so that the daily updates from NCBI servers incorporate the most recently available sequence data from all sources.
Direct electronic submission
Virtually all records enter GenBank as direct electronic submissions (www.ncbi.nlm.nih.gov/Genbank/
), with the majority of authors using the BankIt or Sequin programs. Many journals require authors with sequence data to submit the data to a public database as a condition of publication. GenBank staff can usually assign an accession number to a sequence submission within two working days of receipt, and do so at a rate of almost 1600/day. The accession number serves as confirmation that the sequence has been submitted and provides a means for readers of articles in which the sequence is cited to retrieve the data. Direct submissions receive a quality assurance review that includes checks for vector contamination, proper translation of coding regions, correct taxonomy and correct bibliographic citations. A draft of the GenBank record is passed back to the author for review before it enters the database.
Authors may ask that their sequences be kept confidential until the time of publication. Since GenBank policy requires that the deposited sequence data be made public when the sequence or accession number is published, authors are instructed to inform GenBank staff of the publication date of the article in which the sequence is cited in order to ensure a timely release of the data. Although only the submitter is permitted to modify sequence data or annotations, all users are encouraged to report lags in releasing data or possible errors or omissions to GenBank at update/at/ncbi.nlm.nih.gov.
NCBI works closely with sequencing centers to ensure timely incorporation of bulk data into GenBank for public release. GenBank offers special batch procedures for large-scale sequencing groups to facilitate data submission, including the program tbl2asn, described at www.ncbi.nlm.nih.gov/Sequin/table.html
Submission using BankIt
About a third of author submissions are received through an NCBI web-based data submission tool named BankIt (www.ncbi.nlm.nih.gov
/BankIt). Using BankIt, authors enter sequence information directly into a form and add biological annotation such as coding regions or mRNA features. Free-form text boxes, list boxes and pull-down menus allow the submitter to describe the sequence further without having to learn formatting rules or restricted vocabularies. Before creating a draft record in the GenBank flat file format for the submitter to review, BankIt validates the submissions by flagging many common errors and checking for vector contamination using a variant of BLAST called Vecscreen. BankIt is the tool of choice for simple submissions, especially when only one or a small number of records is being submitted (7
). Submitters can also use BankIt to update their existing GenBank records. In 2009, NCBI released a new version of BankIt (www.ncbi.nlm.nih.gov
/WebSub/?tool=genbank) that offers several improvements: a depositor’s contact information is stored and easily reused in future submissions; sets of sequences can be uploaded as one submission; feature table data can be uploaded from a file; and a submitter can leave a partially finished submission and return later to complete it.
Submission using Sequin and tbl2asn
NCBI also offers a standalone multiplatform submission program called Sequin (www.ncbi.nlm.nih.gov
/projects/Sequin/) that can be used interactively with other NCBI sequence retrieval and analysis tools. Sequin handles simple sequences, such as a single cDNA, as well as segmented entries, phylogenetic studies, population studies, mutation studies, environmental samples and alignments for which BankIt and other web-based submission tools are not well suited. Sequin has convenient editing and complex annotation capabilities and contains a number of built-in validation functions for quality assurance. In addition, Sequin is able to accommodate large sequences, such as the 5.6 Mb Escherichia coli
genome, and read in a full complement of annotations from simple tables. The most recent version, Sequin 9.50, was released in July 2009 and is available for Macintosh, PC and UNIX computers via anonymous FTP at ftp.ncbi.nih.gov/sequin
. Once a submission is completed, submitters can e-mail the Sequin file to gb-sub/at/ncbi.nlm.nih.gov
. Submitters of large, heavily annotated genomes may find it convenient to use ‘tbl2asn’ (described above) to convert a table of annotations generated from an annotation pipeline into an ASN.1 (Abstract Syntax Notation One) record suitable for submission to GenBank.
Submission of barcode sequences
The Consortium for the Barcode of Life (CBOL) is an international initiative to develop DNA barcoding as a tool for characterizing species of organisms using a short DNA sequence, usually a 648-bp fragment of the gene for cytochrome oxidase subunit I. NCBI, in collaboration with CBOL (www.barcoding.si.edu/
) has created an online tool (BarSTool) for the bulk submission of barcode sequences to GenBank (www.ncbi.nlm.nih.gov
/WebSub/?tool=barcode) that allows users to upload files containing a batch of sequences with associated source information. The Nucleotide query ‘barcode[keyword]’ retrieves almost 21 000 barcode sequences in GenBank, over 5000 of which were added in the last year.
Sequence identifiers and accession numbers
Each GenBank record, consisting of both a sequence and its annotations, is assigned a unique identifier called an accession number that is shared across the three collaborating databases (GenBank, DDBJ and EMBL). The accession number appears on the ACCESSION line of a GenBank record and remains constant over the lifetime of the record, even when there is a change to the sequence or annotation. Changes to the sequence data itself are tracked by an integer extension of the accession number, and this Accession.version identifier appears on the VERSION line of the GenBank flat file. The initial version of a sequence has the extension ‘.1’. In addition, each version of the DNA sequence is also assigned a unique NCBI identifier called a ‘GI’ number that also appears on the VERSION line following the Accession.version:
VERSION AF000001.1 GI: 987654321
When a change is made to a sequence in a GenBank record, a new GI number is issued to the updated sequence and the version extension of the Accession.version identifier is incremented. The accession number for the record as a whole remains unchanged, and will always retrieve the most recent version of the record; the older versions remain available under the old Accession.version identifiers and their original GI numbers.
A similar system tracks changes in the corresponding protein translations. These identifiers appear as qualifiers for CDS features in the FEATURES portion of a GenBank entry, e.g. /protein_id=‘AAA00001.1’. Protein sequence translations also receive their own unique GI number, which appears as a second qualifier on the CDS feature:/db_xref=‘GI: 1233445’.