With the development of next-generation sequencing (NGS) technology, researchers have had to adapt quickly to cope with the vast increase in raw data. Experiments that would previously have been conducted with microarrays and resulted in several megabytes of data, are now performed by sequencing, producing many gigabytes, and demanding a significant investment in computational infrastructure. While the cost of disk storage has also steadily decreased over time, it has not matched the dramatic change in the cost and volume of sequencing. A transformative breakthrough in storage technology may occur in the coming years, but the era of the $1000 genome is certain to arrive before that of the $100 petabyte hard disk.
As cloud computing and software as a service become increasingly relevant to molecular biology research, hours spent transferring NGS data sets to and from off-site servers for analysis will delay meaningful results. More often researchers will be forced to maximize bandwidth by physically transporting storage media (via the ‘sneakernet’), an expensive and logistically complicated option. These difficulties will only be amplified as exponentially more sequencing data are generated, implying that even moderate gains in domain-specific compression methods will translate into a significant reduction in the cost of managing these massive data sets over time.
Storage and analysis of NGS data centers primarily around two formats that have arisen recently as de facto
standards: FASTQ and SAM. FASTQ stores, in addition to nucleotide sequences, a unique identifier for each read and quality scores, which encode estimates of the probability that each base is correctly called. For its simplicity, FASTQ is a surprisingly ill-defined format. The closest thing to an accepted specification is the description by Cock et al.
), but the format arose ad hoc
from multiple sources (primarily Sanger and Solexa/Illumina), so a number of variations exist, particularly in how quality scores are encoded. The SAM format is far more complex but also more tightly defined, and comes with a reference implementation in the form of SAMtools (2
). It is able to store alignment information in addition to read identifiers, sequences and quality scores. SAM files, which are stored in plain text, can also be converted to the BAM format, a compressed binary version of SAM, which is far more compact and allows for relatively efficient random access.
Compression of nucleotide sequences has been the target of some interest, but compressing NGS data, made up of millions of short fragments of a greater whole, combined with metadata in the form of read identifiers and quality scores, presents a very different problem and demands new techniques. Splitting the data into separate contexts for read identifiers, sequences and quality scores, and compressing them with the Lempel-Zip algorithm and Huffman coding has been explored by Tembe et al.
) and Deorowicz and Grabowski (4
), who demonstrate the promise of domain-specific compression with significant gains over general-purpose programs like gzip and bzip2.
Kozanitis et al.
) and Hsi-Yang Fritz et al.
) proposed reference-based compression methods, exploiting the redundant nature of the data by aligning reads to a known reference genome sequence and storing genomic positions in place of nucleotide sequences. Decompression is then performed by copying the read sequences from the genome. Though any differences from the reference sequence must also be stored, referenced-based approaches can achieve much higher compression and they grow increasing efficient with longer read lengths, since storing a genomic position requires the same amount of space, regardless of the length of the read.
This idea is explored also in the Goby format (http://campagnelab.org/software/goby/
), which has been proposed as an alternative to SAM/BAM, the primary functional difference being that sequences of aligned reads are not stored but looked up in a reference genome when needed (frequently they are not). For some applications, reference-based compression can be taken much further by storing only single nucleotide polymorphism (SNP) information, summarizing a sequencing experiment in several megabytes (7
). However, even when SNP calls are all that is needed, discarding the raw reads would prevent any reanalysis of the data.
While a reference-based approach typically results in superior compression, it has a number of disadvantages. Most evidently, an appropriate reference sequence database is not always available, particularly in the case of metagenomic sequencing. One could be contrived by compiling a set of genomes from species expected to be represented in the sample. However, a high degree of expertise is required to curate and manage such a project-dependent database. Secondly, there is the practical concern that files compressed with a reference-based approach are not self-contained. Decompression requires precisely the same reference database used for compression, and if it is lost or forgotten, the compressed data becomes inaccessible.
Another recurring theme in the the growing literature on short read compression is lossy encoding of sequence quality scores. This follows naturally from the realization that quality scores are particularly difficult to compress. Unlike read identifiers, which are highly redundant, or nucleotide sequences, which contain some structure, quality scores are inconsistently encoded between protocols and computational pipelines and are often simply high-entropy. It is dissatisfying that metadata (quality scores) should consume more space than primary data (nucleotide sequences). Yet, also dissatisfying to many researchers is the thought of discarding information without a very good understanding of its effect on downstream analysis.
A number of lossy compression algorithms for quality scores have been proposed, including various binning schemes implemented in QScores-Archiver (8
) and SCALCE (http://scalce.sourceforge.net
), scaling to a reduced alphabet with randomized rounding in SlimGene (5
), and discarding quality scores for bases which match a reference sequence in Cramtools (6
). In SCALCE, SlimGene and Cramtools, quality scores may also be losslessly compressed. Kozanitis et al.
) analyzed the effects of their algorithm on downstream analysis. Their results suggest that while some SNP calls are affected, they are primarily marginal, low-confidence calls between hetero- and homozygosity.
Decreasing the entropy of quality scores while retaining accuracy is an important goal, but successful lossy compression demands an understanding of what is lost. For example, lossy audio compression (e.g. MP3) is grounded in psychoacoustic principles, preferentially discarding the least perceptible sound. Conjuring a similarly principled method for NGS quality scores is difficult given that both the algorithms that generate them and the algorithms that are informed by them are moving targets. In the analysis by Kozanitis et al.
), the authors are appropriately cautious in interpreting their results, pointing out that ‘there are dozens of downstream applications and much work needs to be done to ensure that coarsely quantized quality values will be acceptable for users.’
In the following sections, we describe and evaluate Quip, a new lossless compression algorithm which leverages a variety of techniques to achieve very high compression over sequencing data of many types, yet remains efficient and practical. We have implement this approach in an open-source tool that is capable of compressing both BAM/SAM and FASTQ files, retaining all information from the original file.