The SRA follows the established INSDC data-exchange convention where public data are exchanged between the INSDC partners on a daily basis. This allows all public data to be accessed at each site regardless of the point of the original submission.
Before next-generation sequencing platforms existed, the most commonly used format for the representation of base calls and quality scores was the Sanger Fastq format (6
). In 2001, a new community format was created which also supported the inclusion of the signal information: the ZTR format (7
). SRF, a further development of ZTR, became the first widely used cross-platform format for storing next-generation sequence data. The SRF format gained a substantial user base from Illumina GA and SOLiD™ users, while the earlier SFF format (http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=formats&m=doc&s=format#sff
) became the standard for the 454 platform. In 2009, SAM and BAM (8
) were introduced as generic formats for storing read alignments against reference sequences. Sequence alignments are increasingly generated as a primary analysis intermediate, and BAM is expected to replace SRF as the preferred submission format to the SRA; importantly, BAM supports not only aligned, but also unaligned reads which are also recommended to be submitted to SRA. The SRA archives are currently working together with community experts to define an archival BAM format with the goal of making submission and exchange of BAM files as easy as possible.
Efficient storage and compression of next-generation sequence data has always been one of the main objectives of the SRA. Internally, the SRA uses the NCBI SRA Toolkit (http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=show&f=software&m=software&s=software
) for storing and exchanging all next-generation sequence data. Critically, the toolkit contains a configurable storage and compression architecture allowing current best practices to co-exist with future ones. The NCBI SRA Toolkit has established itself as an important part of the SRA operations at NCBI, EBI and DDBJ, who now routinely validate and convert submitted data into the SRA Toolkit format. This format is used for data exchange by the SRA partners, converted to other formats such as Fastq, and made available to other applications through its standard API. For example, the NCBI BLAST has been extended to do sequence similarity searches using the files generated by the NCBI SRA Toolkit.