The data in GenBank and the collaborating databases, EMBL and DDBJ, are submitted either by individual authors to one of the three databases or by sequencing centers as batches of EST, STS, GSS, HTC, WGS or HTG sequences. Data are exchanged daily with DDBJ and EMBL so that the daily updates from NCBI servers incorporate the most recently available sequence data from all sources.
Direct electronic submission
Virtually all records enter GenBank as direct electronic submissions (www.ncbi.nlm.nih.gov/Genbank/
), with the majority of authors using the BankIt or Sequin programs. Many journals require authors with sequence data to submit the data to a public sequence database as a condition of publication. GenBank staff can usually assign an accession number to a sequence submission within two working days of receipt, and do so at a rate of approximately 3500 per day. The accession number serves as confirmation that the sequence has been submitted and provides a means for readers of articles in which the sequence is cited to retrieve the data. Direct submissions receive a quality assurance review that includes checks for vector contamination, proper translation of coding regions, correct taxonomy and correct bibliographic citations. A draft of the GenBank record is passed back to the author for review before it enters the database.
Authors may ask that their sequences be kept confidential until the time of publication. Since GenBank policy requires that the deposited sequence data be made public when the sequence or accession number is published, authors are instructed to inform GenBank staff of the publication date of the article in which the sequence is cited in order to ensure a timely release of the data. Although only the submitter is permitted to modify sequence data or annotations, all users are encouraged to report lags in releasing data or possible errors or omissions to GenBank at email@example.com.
NCBI works closely with sequencing centers to ensure timely incorporation of bulk data into GenBank for public release. GenBank offers special batch procedures for large-scale sequencing groups to facilitate data submission, including the program tbl2asn
, described at www.ncbi.nlm.nih.gov/Sequin/table.html
Submission using BankIt
About a third of author submissions are received through an NCBI Web-based data submission tool named BankIt (see ‘Recent Developments’ section). Using BankIt, authors enter sequence information directly into a form and add biological annotation such as coding regions or mRNA features. Text boxes and pull-down menus allow the submitter to describe the sequence further without having to learn formatting rules or controlled vocabularies. Additionally, BankIt now allows submitters to upload source and annotation using tab-delimited tables. Before creating a draft record in the GenBank flat file format for the submitter to review, BankIt validates the submissions by flagging many common errors and checking for vector contamination using a variant of BLAST called Vecscreen.
Submission using Sequin and tbl2asn
NCBI also offers a standalone multi-platform submission program called Sequin (www.ncbi.nlm.nih.gov/projects/Sequin/
) that can be used interactively with other NCBI sequence retrieval and analysis tools. Sequin handles simple sequences, such as a single cDNA, as well as segmented entries, phylogenetic studies, population studies, mutation studies, environmental samples and alignments. Sequin has convenient editing and complex annotation capabilities and contains a number of built-in validation functions for quality assurance. In addition, Sequin is able to accommodate large sequences, such as the 5.6
Mb Escherichia coli
genome, and read in a full complement of annotations from simple tables. The most recent version, Sequin 10.0, was released in April 2010 and is available for Macintosh, PC and Unix computers via anonymous FTP at ftp.ncbi.nih.gov/sequin
. Once a submission is completed, submitters can e-mail the Sequin file to firstname.lastname@example.org. Submitters of large, heavily annotated genomes may find it convenient to use tbl2asn
to convert a table of annotations generated from an annotation pipeline into an ASN.1 (Abstract Syntax Notation One) record suitable for submission to GenBank.
Submission of Barcode sequences
The Consortium for the Barcode of Life (CBOL) is an international initiative to develop DNA barcoding as a tool for characterizing species of organisms using a short DNA sequence. For animal species, a 648-bp fragment of the gene for cytochrome oxidase subunit I is used as the barcode. The plant and fungal communities are investigating other loci. NCBI, in collaboration with CBOL (www.barcoding.si.edu/
) has created an online tool (BarSTool) for the bulk submission of barcode sequences to GenBank (www.ncbi.nlm.nih.gov/WebSub/?tool=barcode
) that allows users to upload files containing a batch of sequences with associated source information. The Nucleotide query ‘barcode[keyword]’ retrieves the almost 200
000 barcode sequences in GenBank, over 160
000 of which were added in the last year.
Notes on particular divisions
Transcriptome Shotgun Assembly (TSA) sequences
The TSA division contains transcriptome shotgun assembly sequences that are assembled from sequences deposited in the NCBI Trace Archive, the Sequence Read Archive (SRA) and the EST division of GenBank. The TSA division has grown dramatically in the past year () in response to the over 40 Terabasepairs of data deposited into SRA in the same period from next-generation sequencing technologies, including those from Roche-454 Life Sciences, Illumina Solexa and Applied Biosystems SOLiD. Neither the Trace Archive nor SRA is a part of GenBank and are described elsewhere (4
). TSA records (e.g. EZ000001) have ‘TSA’ as their keyword and a Primary block that provides the base ranges and identifiers of the sequences used in the TSA assembly.
Environmental sample sequences (ENV)
The ENV division of GenBank accommodates non-WGS sequences obtained via environmental sampling methods in which the source organism is unknown. Many ENV sequences arise from metagenome samples derived from microbiota in various animal tissues, such as within the gut or skin, or from particular environments, such as freshwater sediment, hot springs or areas of mine drainage. Records in the ENV division contain ‘ENV’ in the keyword field and use an ‘/environmental_sample’ qualifier in the source feature.
Whole genome shotgun sequences
WGS sequences appear in GenBank as sets of WGS sequence overlap contigs, each of which is issued an accession number consisting of a four-letter project ID, followed by a two-digit version number and a six-digit contig ID. Hence, the WGS accession number ‘AAAA01072744’ is assigned to contig number ‘072744’ of the first version of project ‘AAAA’. WGS sequencing projects have contributed over 64 million contigs to GenBank, and these primary sequences have been used to construct 8 million large-scale assemblies of scaffolds and chromosomes. For a complete list of WGS projects with links to the data, see www.ncbi.nlm.nih.gov/Traces/wgs/
Although WGS project sequences may be annotated, many low-coverage genome projects do not contain annotation. Because these sequence projects are ongoing and incomplete, these annotations may not be tracked from one assembly version to the next and should be considered preliminary. Submitters of genomic sequences, including WGS sequences, are urged to use evidence tags of the form ‘/experimental=text’ and ‘/inference=TYPE:text’, where TYPE is one of a number of standard inference types and text consists of structured text.
ESTs continue to be a major source of data for gene expression and annotation studies, and at almost 37 billion base pairs, it remains the largest non-WGS division in GenBank. EST data are available for download from ftp.ncbi.nih.gov/repository/dbEST/
) as well as from the GenBank FTP site. The data in dbEST are clustered using the BLAST programs to produce the UniGene database (www.ncbi.nlm.nih.gov/unigene
) of more than 4.3 million gene-oriented sequence clusters representing over 120 organisms (4
High-throughput genomic (HTG) and high-throughput cDNA (HTC) sequences
The HTG division of GenBank (www.ncbi.nlm.nih.gov/HTGS/
) contains unfinished large-scale genomic records, which are in transition to a finished state (7
). These records are designated as belonging to Phases 0–3 depending on the quality of the data, with Phase 3 being the finished state. Upon reaching Phase 3, HTG records are moved into the appropriate organism division of GenBank.
The HTC division of GenBank contains high-throughput cDNA sequences that are of draft quality but may contain 5′ UTRs, 3′ UTRs, partial coding regions and introns. HTC sequences which are finished and of high quality are moved to the appropriate organism division of GenBank. A project generating HTC data is described in ref. (8
Special record types
Third party annotation
Third Party Annotation (TPA) records are sequence annotations published by someone other than the original submitter of the primary sequence record in DDBJ/EMBL/GenBank (www.ncbi.nlm.nih.gov/Genbank/TPA.html
). TPA records fall into one of three categories: experimental
, in which case there is direct experimental evidence for the existence of the annotated molecule; inferential
, in which case the experimental evidence is indirect; and reassembly
, where the focus is on providing a better assembly of the raw reads. TPA sequences may be created by assembling a number of primary sequences. The format of a TPA record (e.g. BK000016) is similar to that of a conventional GenBank record but includes the label ‘TPA_exp:’, ‘TPA_inf:’ or ‘TPA_reasm:’ at the beginning of each Definition Line as well as corresponding keywords. TPA experimental and inferential records also contain a Primary block similar to that in a TSA record. Currently GenBank contains over 5.6 million TPA records, >99% of which are derived from a recent submission of an individual human genome. TPA sequences are not released to the public until their accession numbers or sequence data and annotation appear in a peer-reviewed biological journal. TPA submissions to GenBank may be made using either BankIt or Sequin.
Contig (CON) records for assemblies of smaller records
Small genomes, such as those from bacteria, can generally be conveniently represented and analyzed as single sequences. For very long sequences, such as a eukaryotic chromosome, where the sequence is not complete but consists of several contig records with uncharacterized gaps between them, the entire chromosome is represented in GenBank as a CON record. Rather than listing the sequence itself, CON records contain assembly instructions involving the several component sequences. An example of such a CON record is DP000010 for rice chromosome 11.