|Home | About | Journals | Submit | Contact Us | Français|
The recent super-exponential growth in the amount of sequencing data generated worldwide has put techniques for compressed storage into the focus. Most available solutions, however, are strictly tied to specific bioinformatics formats, sometimes inheriting from them suboptimal design choices; this hinders flexible and effective data sharing. Here, we present CARGO (Compressed ARchiving for GenOmics), a high-level framework to automatically generate software systems optimized for the compressed storage of arbitrary types of large genomic data collections. Straightforward applications of our approach to FASTQ and SAM archives require a few lines of code, produce solutions that match and sometimes outperform specialized format-tailored compressors and scale well to multi-TB datasets. All CARGO software components can be freely downloaded for academic and non-commercial use from http://bio-cargo.sourceforge.net.
A few typical strategies have been employed so far to implement compressed storage for genomic data:
Usually solutions belonging to the first category are relatively straightforward to implement and optimize, and sufficient for archival purposes. However, they are unsuitable for applications requiring more sophisticated or volatile derived data models, like ‘on-the-fly’ intermediate schemes produced by data analysis workflows that do not necessarily correspond to an available long-term storage format. Solutions in the second category automatically inherit many pros of the framework they build upon, mainly high-level features and seamless integration with other tools. However, they also constrain the user within the boundaries of a possibly cumbersome predefined architectural computational ecosystem, and often turn out to be suboptimal in terms of either storage usage or CPU time consumption.
Here, we present a different approach, illustrated in Figure Figure1,1, that focuses on flexibility and immediate deployability in any local high-performance computing setup. It is based on a few key concepts:
In this section, we describe in more detail how the major features provided by CARGO are implemented.
In order to illustrate how CARGO meta-language is translated to C++ code, we will use a simple proof-of-concept FASTQ record definition (see Figure Figure3A).3A). More realistic worked-out examples for both FASTQ and SAM format, together with the full definition of the CARGO meta-language in Backus-Naur form, can be found in Supplementary Documentation Section 4.
According to the workflow depicted in Figure Figure1,1, the CARGO meta-language record definition (as in Figure Figure3A)3A) will need to be processed with the CARGO tools. This operation will generate several files containing C++ code. The most relevant ones are:
All the C++ files mentioned above are automatically deduced from the provided CARGO meta-language type definition. In particular, both the C++ record type definition and the C++ TypeAPI-based record description can be emitted without any user intervention. This is possible thanks to the CARGO translator application (see Supplementary Documentation for a full description of the program and its options): it parses programs in CARGO meta-language, builds an internal representation of each type in the form of a set of annotated abstract syntax trees (ASTs; one per each @record directive present in the meta-code, as in Figure Figure3A)3A) and turns each AST into valid C++ code. Such code relies upon the TypeAPI classes in order to interface with the CARGO library and so provide access functionality to compressed containers. However, one should note that the TypeAPI is designed to provide an implementation layer automatically invoked by the CARGO meta-language translator, not something whose details should be directly exposed to the user. Most users will never need to look into the TypeAPI-based record description, unless they need to perform some very low-level optimization. For instance, none of the CARGO compressors described in this article required any modification to the machine-generated TypeAPI code.
Even a very simple CARGO meta-code like that of Figure Figure3A3A is able to produce optimized compressors/decompressors that rival in performance with the current state of the art (see Supplementary Data Section 1.3). However, more complex file formats and more sophisticated data transformation strategies (like for instance those relying upon reference-based sequence compression) are likely to require more fine-tuned CARGO meta-language record definitions. A fully fledged record type definition, suitable for implementing the high-performance SAM compressor presented in Figure Figure55 methods 1, 2 and 7, is shown in Figure Figure22.
As illustrated in Figure Figure4,4, containers are made of three different storage areas:
In order to provide for easier backup and recovery, in our current implementation each storage area corresponds to a separate file.
A key design feature of streams is that the blocks composing them do not need to be consecutive. As a result, many different interleaved streams can coexist in the same container, and, provided that a suitable locking mechanism is put in place, many processes can read and write to the same container concurrently.
Overall our container design is very flexible, in that it allows for an easy implementation of many high-level features. For instance, when the amount of needed space is not known in advance a container can be generated with an arbitrary size; however, once all the datasets of interest are stored in it, the container can be shrunk in order to remove unused blocks and facilitate data archival/exchange (see Supplementary Data Section 3.2.2). Although not yet currently implemented, other possible features would be block encryption and resilience to error. The latter would work in a way similar to what is offered by filesystem recovery programs: it would rely on the presence of checksum information within the blocks in order to detect possible corruptions due to hardware errors occurring in the physical layer of the storage, and would allow the reconstruction of the whole dataset except for the records directly hinging on the affected block.
As each stream is saved as an independent list of blocks, after a record had been parsed and optionally transformed using a user-specified adaptor function, each block can be compressed independently (the same holds in the reverse order for decompression). Considering that record parsing and transformation are usually not the data-processing bottlenecks, this scheme offers big room for implementing transparent parallelization within the CARGO library. In our current implementation, we follow a standard producer/consumer paradigm and dispatch a small slice of the input records to each processing thread. After streams have been parallelized in this way, I/O throughput usually becomes the performance bottleneck. This is why selecting big block sizes can improve performance, especially on the network-distributed filesystems usually found in high-performance computing environments.
In addition to the custom record transformation functionality, another important feature offered by CARGO is the possibility of specifying an arbitrary custom sorting function defined in terms of the record, and then performing range queries on such a generated key. In this case streams will contain additional information stored in the dataset area, consisting of keys sampled at regular intervals. This allows us to implement locating searches by simple bisection methods. Once blocks within the query range have been identified, subsequent data extraction can be easily parallelized.
In Supplementary Documentation Section 8 we demonstrate in a step-by-step tutorial how families of compressors/decompressors from/to formats widely used in the bioinformatics of high-throughput sequencing, such as FASTQ (9) and SAM, can be implemented with CARGO. In fact, several implementations of increasing complexity and sophistication are included in the standard CARGO distribution. Our full results are presented in Supplementary Data Section 1; here we focus on two benchmarks testing recompression of SAM files, although similar results hold true for FASTQ files. In general our framework provides unprecedented flexibility, excellent performance and good scalability.
Figure Figure55 illustrates our results when recompressing a 82 GB SAM file with different methods (dataset HG01880 from the 1000 Genome Project (10), see Supplementary Data for a complete description of all methods).
Perhaps the most striking feature of our approach is that in general very little code is required to achieve results that are comparable to, or better than, what state-of-the art compressors can obtain. In fact about 30 lines of CARGO meta-code supplemented with less than 90 lines of C++ code on the top of our framework are sufficient to implement a SAM compressor achieving compression levels similar to those offered by the recently published DeeZ (5) and being several times faster at both compression and decompression (Figure (Figure5,5, methods 4 and 12 versus methods 6, 8 and 9).
With some more code (70 lines of CARGO meta-code and about 2000 lines of C++ code) one can implement a fully-fledged SAM compressor offering advanced features like reference-based sequence compression and optional lossy Illumina 8-level quality rebinning; both compression levels and (de)compression speed are on par with those provided by a recent version of HTSLib-based (11) CRAM (12) implementation (Figure (Figure5,5, method 2 versus 3 and method 7 versus 10). In addition, by changing a few lines in the meta-code description we can easily generate a family of compressors that are suitable for different scenarios (like: slower compression but faster decompression; or both slower compression and decompression but smaller final size).
Our best lossy quality resampling scheme uses the same compression setups as CRAM, additionally stripping out read names and optional SAM fields as in the initial implementation of CRAM; it produces archives that are about two times smaller than the corresponding CRAM ones, and is able to compress/decompress faster than CRAM (Figure (Figure5,5, method 1 versus 3).
Of note, although performing the same function our family of compressors and those developed so far to operate on fixed-format files differ in a number of fundamental philosophical points. In our case, rather than from highly optimized ad hoc code, good performance stems as a straightforward by-product from simple design choices, in particular:
In addition while, say, a SAM file compressed to CRAM needs a specific tool for the semantics of its content to be successfully recovered, the same data compressed within our framework does not. For instance, we fully parse the SAM record including its optional fields, and every component of the data structure is subsequently available to (de)compressing programs on the C++ side of the CARGO implementation. It would be perfectly possible to output a different format (for instance, one including less or additional fields) starting from the same database.
In addition, our approach can easily scale up to multi-TB datasets. Figure Figure66 illustrates the results of an experiment whereby we compress a large collection of SAM volumes from the 1000 Genome Project (17 TB uncompressed total, see Supplementary Data Section 2 for the complete list of archives) into:
The CARGO scheme was selected as a reference because it is directly comparable with BAM or CRAM, provided that suitable parameters are selected for those methods (see Supplementary Data for a full list). CARGO streams were stored sorted by genomic position, in order to make the archive searchable and mimic the search-by-position capabilities offered by BAM/CRAM formats.
In line with the results presented in the previous section for smaller files, also in this case our best method achieved a compression rate that is almost twice as much as that of BAM's, and significantly better than the one provided by CRAM. Notwithstanding, and depending on the size of the range queried, the time needed to query a CARGO container is either slightly worse, or comparable to, the one needed to query the much less compressed BAM container; and it is much better than the querying time for a CRAM container (see Supplementary Data Section 1.2). Finally, the data throughput obtained by CARGO is several times higher than what SAMtools implementations of either BAM or CRAM format can offer at the moment (see Supplementary Data Sections 1.1 and 1.2), making CARGO an ideal storage tool for high-performance downstream applications.
In general, what our approach can achieve goes far beyond the compression of formats like FASTQ or SAM, offering many advantages with respect to most solutions currently available in the field of genomics:
As our results stem from a few lines of application-specific code, they are just a glimpse of what CARGO can offer to the field of high-throughput genomics. Bioinformatics has been traditionally plagued by inflexible file formats that constrain effective data exchange. Hopefully our work will demonstrate to the community that using a different, more flexible and high-level approach does not necessarily imply having to forgo the potential benefits—like suitability for high-performance computation and optimization—offered by more traditional solutions.
The authors would like to thank Sebastian Deorowicz, Roderic Guigó, Simon Heath and Ivo Gut for supporting the project, Leonor Frias-Moya and Santiago Marco-Sola for fruitful discussions and technical insights, and the anonymous reviewers for their helpful comments and feedback.
Supplementary Data are available at NAR Online.
Biotechnology and Biological Sciences Research Council (BBSRC) [grant BBS/E/I/00001942 to P.R.]; European Union's Seventh Framework Programme (FP7/2007-2013) [grant agreement No. 305444 (RD-Connect) to Ł.R.]; European Union's European Social Fund [INTERKADRA project UDAPOKL-04.01.01-00-014/10-00 to Ł.R.]. Funding for open access charge: CNAG (own funds) and BBSRC (grant for article processing charges to The Pirbright Institute).
Conflict of interest statement. None declared.