|Home | About | Journals | Submit | Contact Us | Français|
Summary: We present a format for efficient storage of multiple tracks of numeric data anchored to a genome. The format allows fast random access to hundreds of gigabytes of data, while retaining a small disk space footprint. We have also developed utilities to load data into this format. We show that retrieving data from this format is more than 2900 times faster than a naive approach using wiggle files.
Availability and Implementation: Reference implementation in Python and C components available at http://noble.gs.washington.edu/proj/genomedata/ under the GNU General Public License.
The advent of functional genomics assays based on next-generation sequencing (Brunner et al., 2009; Hesselberth et al., 2009; Park, 2009; Wold and Myers, 2008) finally allows the high-throughput acquisition of data at 1-bp resolution across entire genomes. Processing this information, however, provides a challenge for several orders of magnitude beyond that of previous genomic analyses and demands new techniques for efficient operation. We introduce the Genomedata format for genome-scale numerical data, which uses an HDF5 (Hierarchical Data Format; http://hdfgroup.org/HDF5/) container for efficient, random access to huge genomic datasets. We also provide a Python interface to this format.
Traditional data interchange formats such as the wiggle (http://genome.ucsc.edu/goldenPath/help/wiggle.html) and bedGraph (http://genome.ucsc.edu/goldenPath/help/bedgraph.html) formats provide excellent means of disseminating genome-wide datasets but suffer from several disadvantages in the repeated processing of this data. Storing numerical data as ASCII text is inefficient and impedes random access to the data. This problem becomes even more apparent when processing the data in scripting languages such as Python and R, which provide high-performance methods for bulk numerical operations on arrays, but no method for reading in data in interchange formats quickly. It is also necessary to validate this data before use, checking that there is exactly one data point per position and that data are not defined outside the boundaries of the underlying sequence. Genomedata provides an intermediate format and off-loads the frustrations of parsing and validating the data from an analysis programmer. It provides the conveniences of an application programming interface for reading a binary file format, akin to the programmatic access to sequence and alignment data provided by BAM (Li et al., 2009) and BioHDF (Mason et al., 2010), while being suited for dense numeric data such as bigWig (Rhead et al., 2010).
In many workflows, Genomedata allows the user to parse, validate and convert the data into a binary format once, eliminating the computational expense of doing this repeatedly. The data are stored as 32-bit IEEE floating point numbers to allow minimal processing when loading into memory. Not a number entries are used where data are missing or unassigned. HDF5 transparently breaks the data into chunks aligned with data columns, so that it minimizes work during loading. Genomedata compresses these chunks when stored on disk to save space, especially when values are repeated within a column, but in a way that still facilitates efficient random access. We also store some metadata in the archive such that simple summary statistics may be accessed quickly.
To ease the memory requirements of subsequent analysis, Genomedata may optionally break chromosomes into ‘supercontigs,’ which avoid the allocation of empty space in the observation matrix at large assembly gaps (by default, >100 000 bp). This is not necessary for efficient performance on disk, but it is convenient for programmers who wish to process the whole genome.
The reference implementation includes several programs for loading data. The software requires Python 2.5.1, HDF5 1.8 and PyTables 2.1.
Genomedata supplies command-line utilities that make it easy to create archives and load data. The genomedata-load command loads the genome sequence and a number of tracks in wiggle, BED or bedGraph formats, and stores metadata that allow one to rapidly calculate summary statistics such as minimum, maximum, mean or SD. The package also contains utilities to complete only parts of the loading process so that one may load tracks for different chromosomes in parallel.
It is easy to access data in a Genomedata archive using the supplied Python interface. A programmer may retrieve a matrix of data by specifying individual coordinate ranges to the Genomedata interface. Alternatively, one can iterate through the entire dataset chromosome by chromosome. Programmers can accomplish tasks such as reporting the average data value in a number of tracks for specified genomic regions easily, allowing a greater focus on more interesting areas of analysis.
Genomedata can quickly load large amounts of data. We measured the time to load a Genomedata archive with the complete human genome sequence (build NCBI36) and from one to 11 ChIP-seq data tracks on a 2.33-GHz Intel Xeon E5345 processor, and performed a linear regression on the timing results with the statistical computing environment R. This yielded a model with the coefficient of determination R2 = 0.98, where loading the sequence and other constant overhead took 5.0 ± 2.5 × 103 s, and each track took an additional 7.5 ± 0.4 × 103 s.
One may retrieve functional genomics data from Genomedata archives much more quickly than the text-based formats commonly used for this data. We measured the time to retrieve data from a whole-genome 1-bp-resolution DNase-seq data track at each of a randomly generated list of genomic positions using a method that accessed the original gzip-compressed wiggle file and two different methods that access a Genomedata archive loaded from that file (Fig. 1). The offline (sequential access) wiggle algorithm first sorts the list and then iterates through the original wiggle files until it finds the specified positions. The offline Genomedata algorithm works in a similar way, but iterates through a Genomedata archive instead. The online (random access) Genomedata algorithm retrieves the data at each position in the random order specified by the list. We repeated this process with nine different list sizes to examine the dependence of retrieval time on the number of positions.
Because the offline algorithms read data sequentially rather than randomly, their run times are mostly independent of the number of genomic positions. After creation of the Genomedata archive, the offline Genomedata algorithm ran 2900 times faster than the comparable offline wiggle approach, suggesting a considerable advantage for the use of Genomedata when repeatedly accessing a dataset. Even when including the one-time cost of creating the archive (4 h), the Genomedata approach still ran 10 times faster, because we wrote the Genomedata track loader in C. The advantage for an online Genomedata approach is even greater when retrieving fewer than ~10 000 positions at once. Genomedata is especially suited for whole-genome, dense datasets, so it has less of a comparative advantage in cases of sparse datasets with data at only a limited number of genomic positions. Genomedata should still perform as well, however, in an absolute sense.
Not only does using Genomedata improve performance, but it also makes programming against this type of data easier, resulting in less boilerplate code for data retrieval. According to SLOCCount (http://www.dwheeler.com/sloccount/), which counts the physical source lines of code in a program, it took 70 source lines of code to implement the wiggle method, while only 44 (37% fewer) to implement the offline Genomedata method and 16 (77% fewer) to implement the online Genomedata method.
Funding: National Institutes of Health (HG004695).
Conflict of Interest: none declared.