It is widely accepted that biomedical data should be machine-readable and web-accessible. Relational database management systems (RDBMS)1,2
have proven highly effective with sequence data that are string-based, invariant in organization and interpretable without knowledge of the experiments, instruments or algorithms used to gather them. It has proven more difficult to manage data arising from complex biochemical measurements, imaging, flow cytometry and phenotypic assays of cells and tissues. The interpretation of these data, which are often unstructured (e.g.
, images), is critically dependent on experimental context, and this context changes frequently. The difficulty in developing satisfactory database solutions for “high-content” data is widely ascribed to insufficient standardization or poor implementation3
, but we believe the problem is more fundamental: it reflects the impossibility of fully specifying a priori
complex experimental designs. Flexible and creative design is the essence of good experimental science, and since design determines data structure (the number of time points, repeats, conditions, etc.), structures frequently change (). To accommodate these changes, database schema must be reconfigured frequently, a complex and time-consuming task. Thus, most experimental data reside in unlinked, loosely annotated spreadsheets that are easily fragmented or lost4,5
. When data scope and complexity demand a more capable repository, a new database is often created ad hoc
Figure 1 Challenges in management of multi-dimensional data. (a) The schematic illustrates that experimental design is the key determinant of data dimensionality. Different experimental protocols result in experiments that yield data with different numbers and (more ...)
As an illustrative problem in biological data management, we focus here on high-throughput, high-content microscopy6,7
. Microscopy presents two distinct data management challenges. One is the sheer size of the data, which can exceed many terabytes per month. The second involves the difficulties of working with numerical data extracted by image analysis, which can include a large number of data types that have complex relationships to each other (e.g.
, the boundaries and intensities of cells or compartments and computed features such as nuclear translocation; )8
. For example, a typical genome-wide RNAi screen might generate ~7 × 105
images (~1.3 terabytes of data); analysis would increase the size only modestly (by ~100 megabytes), but the number of data entries would increase from ~106
images to >109
features (Supplementary Fig. 1
). Conventional spreadsheets and comma-separated value (CSV) files perform poorly with 109
data entities, and relational databases impose the organizational costs described above.
In this paper we propose a potential solution to the challenge of managing high-dimensionality biomedical data based on the use of semantically-typed data hypercubes (SDCubes) in which binary data are stored in Hierarchical Data Format 5 (HDF5; http://www.hdfgroup.org/HDF5/
) and metadata and data ontologies are stored in Extensible Markup Language (XML; http://www.w3.org/standards/xml
). We have created a new open-source Java library, the SDCube Programming Library (Supplementary Software 1
) that can create SDCubes with appropriate dimensionality, encode the data model in a machine-readable XML ontology, and reformat SDCubes as needed when experiments change (). To illustrate the use of SDCubes, we have created a second program ImageRail (Supplementary Software 2
) for high-content microscopy that (i) segments images of cells grown in 96- and 384-well plates to extract features such as cell shape or nuclear fluorescence, (ii) stores experimental metadata and results of image analysis in SDCubes, (iii) computes sets of cellular features from the image (e.g.
, fluorescence and localization metrics), and (iv) displays metadata, images and analysis in various formats9
. By using SDCubes, ImageRail is able to organize data according to the design of an experiment and its day-to-day evolution rather than an inflexible, predetermined schema. We use these tools to characterize the responses of tumor cells to therapeutic small molecules and show that the apparent IC50
for receptor inhibitors varies with ligand dose, that cell-to-cell variability is maximal as ligands and drugs approach concentrations likely to be encountered in vivo
, and that variance impacts the shape of dose-response curves. Our results suggest that monitoring variance will be broadly useful in pre-clinical pharmacology. Moreover, because flow cytometry and multiplex biochemistry have similar workflows to imaging4,10
, ImageRail and the SDCube Programming Library represent starting points for managing diverse experimental data.
Figure 2 SDCubes are built from a collection of linked data modules that can encode diverse experimental data with varying requirements. (a) The XML metadata maps the experimental sampling procedure onto the HDF5 data space; data from each cell are represented (more ...)