The sequences and biological annotations in GenBank, and the collaborating databases EMBL and DDBJ, are submitted primarily by individual authors to one of the three databases, or by sequencing centers as batches of EST, STS, GSS, HTC, WGS, or HTG sequences. Information is exchanged daily with DDBJ and EMBL so that the daily updates from NCBI servers incorporate the most recently available sequence data from all sources.
Direct electronic submission
Virtually all records enter GenBank as direct electronic submissions (www.ncbi.nlm.nih.gov/Genbank/index.html
), with the majority of authors using the BankIt or Sequin programs. Many journals require authors with sequence data to submit the data to a public database as a condition of publication.
GenBank staff can usually assign an accession number to a sequence submission within two working days of receipt, and do so at a rate of almost 1600 per day. The accession number serves as confirmation that the sequence has been submitted and allows readers of articles in which the sequence is cited to retrieve the data. Direct submissions receive a quality assurance review that includes checks for vector contamination, proper translation of coding regions, correct taxonomy, and correct bibliographic citations. A draft of the GenBank record is passed back to the author for review before it enters the database. Authors may ask that their sequences be kept confidential until the time of publication. Since GenBank policy requires that deposited sequence data be made public when the sequence or accession number is published, authors are instructed to inform GenBank staff of the publication date of the article in which the sequence is cited in order to ensure a timely release of the data. Although only the submitting scientist is permitted to modify sequence data or annotations, all users are encouraged to report lags in releasing data or possible errors or omissions to GenBank at update/at/ncbi.nlm.nih.gov.
NCBI works closely with sequencing centers to ensure timely incorporation of bulk data into GenBank for public release. GenBank offers special batch procedures for large-scale sequencing groups to facilitate data submission, including the program ‘tbl2asn’, described at www.ncbi.nlm.nih.gov/Sequin/table.html
Submission using BankIt
About one-third of author submissions are received through NCBI's web-based data submission tool, BankIt (www.ncbi.nlm.nih.gov/BankIt
). Using BankIt, authors enter sequence information directly into a form, and add biological annotations such as coding regions, or mRNA features. Free-form text boxes, list boxes, and pull-down menus allow the submitter to further describe the sequence without having to learn formatting rules or restricted vocabularies. BankIt validates submissions, flagging many common errors, and checks for vector contamination using a variant of BLAST called Vecscreen, before creating a draft record in GenBank flat file format for the submitter to review. BankIt is the tool of choice for simple submissions, especially when only one or a small number of records is to be submitted (7
). BankIt can also be used by submitters to update their existing GenBank records.
Submission using Sequin and tbl2asn
NCBI also offers a standalone multi-platform submission program called Sequin (www.ncbi.nlm.nih.gov/Sequin/index.html
) that can be used interactively with other NCBI sequence retrieval and analysis tools. Sequin handles simple sequences such as a cDNA, as well as segmented entries, phylogenetic studies, population studies, mutation studies, environmental samples, and alignments for which BankIt and other web-based submission tools are not well suited. Sequin has convenient editing and complex annotation capabilities and contains a number of built-in validation functions for quality assurance. In addition, Sequin is able to accommodate large sequences, such as that of the 5.6 Mb Escherichia coli
genome, and read in a full complement of annotations via simple tables. Versions for Macintosh, PC and Unix computers are available via anonymous FTP at (ftp.ncbi.nih.gov
) in the ‘sequin’ directory. Once a submission is completed, submitters can e-mail the Sequin file to the address (gb-sub/at/ncbi.nlm.nih.gov
Submitters of large, heavily annotated genomes may find it convenient to use ‘tbl2asn’, referenced above under ‘Direct submission’, to convert a table of annotations generated via an annotation pipeline into an ASN.1 record suitable for submission to GenBank.
Submission of barcode sequences
The Consortium for the Barcode of Life (CBOL) is an international initiative to develop DNA barcoding as a tool for characterizing species of organisms using a short DNA sequence derived from a portion of the cytochrome oxidase subunit I gene. NCBI, in collaboration with CBOL (barcoding.si.edu/index\s\do5(d)etail.htm
), has created an online tool for the bulk submission of barcode sequences to GenBank (www.ncbi.nlm.nih.gov/BankIt/barcode/
) that allows users to upload files containing a batch of sequences with associated source information. It is anticipated that this tool will be used for other types of bulk submissions in the near future.
Sequence identifiers and accession numbers
Each GenBank record, consisting of both a sequence and its annotations, is assigned a unique identifier, the accession number, that is shared across the three collaborating databases (GenBank, DDBJ, EMBL) and remains constant over the lifetime of the record even when there is a change to the sequence or annotation. Each version of the DNA sequence within a GenBank record is also assigned a unique NCBI identifier, called a ‘gi’, that appears on the VERSION line of GenBank flatfile records following the accession number. A third identifier of the form ‘Accession.version’, also displayed on the VERSION line of flatfile records, contains the information present in both the gi and accession numbers. An entry appearing in the database for the first time has an ‘Accession.version’ identifier equivalent to the ACCESSION number of the GenBank record followed by ‘.1’ to indicate the first version of the sequence for the record, e.g.
VERSION AF000001.1 GI: 987654321
When a change is made to a sequence given in a GenBank record, a new gi number is issued to the sequence and the version extension of the ‘Accession.version’ identifier is incremented. The accession number for the record as a whole remains unchanged and the older sequence remains available under the old ‘Accession.version’ identifier and gi.
A similar system tracks changes in the corresponding protein translations. These identifiers appear as qualifiers for CDS features in the FEATURES portion of a GenBank entry, e.g. /protein_id=‘AAA00001.1’. Protein sequence translations also receive their own unique gi number, which appears as a second qualifier on the CDS feature, e.g. /db_xref=' GI:1233445'.
Ensuring stable access to sequence data
It is becoming increasingly popular for research groups to share new biological sequences and update existing sequences by directly posting the data on the Web. While this is a convenient and effective way to share the data among a set of collaborators, if original data and updates are not also submitted to a central repository, three significant problems arise; the access lifetime of the data may be reduced, the full biological context of the data may not be realized, and existing data in heavily used centralized databases will become outdated.
The ephemeral nature of much of the content on the web is part of the common experience of web users. In one attempt to quantify content lifetime, 360 randomly selected web pages were tracked for a period of 4 years, and a half-life of only 2 years was measured for the set (9
). Although a well-maintained web page can certainly persist for longer than 2 years, the relatively short half-life reported for this set of pages reflects the many factors that can intervene to affect access to web-posted data.
Even during the accessible lifetime of web-posted sequence data, however, the full biological context of a sequence may not be realized if the sequence cannot be conveniently compared with others—perhaps derived from distantly related organisms that are beyond the scope of the host web page.
In addition, if updates to sequences contained within centralized databases are made to a web page, but not also made to corresponding records in the central database, the newer data will not reach the wider research community and much of the impact of the data will be lost.
Submission of sequence data to a centralized repository such as GenBank solves these three problems. Researchers are ensured stable access to the data via versioned bimonthly releases available by FTP, NCBI-maintained as well as numerous third party interfaces to a uniform dataset, and the archival redundancy offered by the tripartite International Nucleotide Sequence Databases collaboration. Combining new data with that of other researchers worldwide within a central database provides a broad biological context that stimulates discovery—keeping each sequence current magnifies the utility of all the sequences in the database.