In recent years, systems for performing high-throughput microscopy-based experiments have become available and are often used to test the effects of chemical or genetic perturbations on cells46,47
, to determine the subcellular localization of proteins48,49
or to study gene expression patterns in development50
. These screens produce huge amounts of image data (sometimes tens of terabytes and millions of images) that must be managed, quality controlled, browsed, annotated and interpreted. As a consequence, tools for visualization and analysis are key at virtually all levels of such projects (see ).
Figure 4 Visualization of high–throughput data. (a) By analogy with the ‘eisengram’ for microarray data, the discrete spatial gene expression (left) annotation data can be summarized by a so-called ‘anatogram’ (right), wherein (more ...)
Some large-scale experiments involving particularly complex read-outs have been annotated manually using controlled vocabularies51
, in some cases—as for high-content analysis of gene expression during development—eased by the use of custom-built annotation tools34,52,53
. Several visualization aids have been developed to succinctly summarize the complex, multidimensional annotations and organize them using clustering methods borrowed from the microarray data analysis field. The main challenge of representing qualitative rather than quantitative annotation data was addressed by introducing discrete color coding for the controlled vocabulary terms collapsed to the most informative level in the annotation ontology. In that way, the analog microarray ‘eisengram’ evolved into the digital ‘anatogram’ capable of visually summarizing the gene expression properties of arbitrary groups of genes (). Once large, expert-annotated image sets became available, computational approaches were successfully used to automate the annotation process (for example, automatic annotation of subcellular protein localization54
and gene expression patterns55,56
). In most cases, large image datasets are automatically processed to extract a wide range of attributes from the images.
To navigate efficiently through this sea of data, users need visualization software that can display informative summaries at different levels: in the acquisition and quality-control phase, multiwell plate and similar visualizations () that show image-based data values with a raw data link or thumbnails of images that can be enlarged for careful examination are very helpful. Tools are also needed to show relevant image-based data to biologists in an intuitive manner and enable them to identify meaningful characteristics and explore potential correlations and relationships between data and to point them toward the most interesting samples in their experiment. For this exploratory data analysis, data-enhanced scatter and density plots () and histograms of image-derived data can be used, in which the user can select subsets of data, view examples of the raw images producing those data points and filter data points for further analysis. Browsing these graphical representations linked to the raw data allows biologists to identify interesting subsets (for example, the morphological classes present in a cell-based screen or training sets for subsequent supervised machine learning) in an interactive and intuitive manner. Linking to the original data are particularly important: first, because users must frequently locate relevant images to manually confirm the automated, quantitative results and, second, because there is often no obvious a priori link between quantitative image descriptors and biological meaning. Further analysis of these attributes and eventually identified subsets leads to image annotations (for example, phenotypes) and/or classifications (for example, as a ‘hit’), often by means of supervised learning methods.
When putting the experiment into the context of existing biological knowledge, researchers are concerned about how images and their derived data relate to known biological entities. For example, one may want to browse all images related to a given gene, gene ontology term or chemical treatment. This requires integration with other sources of information, usually external databases. The visualization methods suited for this are also commonly used in systems biology: heat maps and projections in two-dimensional maps57
However, many of these goals remain unaddressed by existing software tools. Gracefully and intuitively presenting rich image data representing possibly hundreds of attributes extracted from billions of cells is a demanding task for a visual analysis tool. Still, some recent developments have begun to ease aspects of these visualization challenges for high-throughput experiments. Several software tools offered by screening-oriented microscope companies enable certain data visualizations (), as does third-party software such as Cellenger and the open-source CellProfiler project58,59
(). These packages integrate image processing algorithms with statistical analysis and graphical representations of the data and also offer machine learning methods that capitalize on the multiple attributes measured in the images. In workflow management software (for example, HCDC; ), where modules communicate through defined inputs and outputs, user-defined visualization modules can be integrated into a data acquisition and processing workflow; this increase in flexibility and history tracking typically comes with a loss in user-interactivity and browsing capabilities.
A selective list of high-throughput visualization tools
Although presentation, representation and querying of primary visual and quantitative data are a significant problem, an associated difficulty is that the dimensionality of data derived from or associated with each image or object is rapidly growing. The problem is to visualize such high-dimensional data in a concise way so that it may be explored to identify patterns and trends at the image level. A common strategy linearly projects high-dimensional data into low dimensions for visualization using various forms of multidimensional scaling60
(for example, principal component analysis, Sammon mapping61
). Multidimensional scaling aims to map high-dimension vectors into low dimensions in such a way as to preserve some measure of distance between the vectors. Once such an embedding or mapping into two or three dimensions has been accomplished, the data can be visualized and any relationships observed. One approach to visualizing and interacting with high-dimensional data and microscopy imaging is the iCluster system62
, developed in association with the Visible Cell63
. Here, large image sets from single or multiple fluorescence microscopy experiments may be visualized in three dimensions (). Spatial placement in three dimensions can be automatically generated by Sammon mapping using high-dimensional texture measures or through user-supplied statistics associated with each image. Thus, sets of images that are statistically and visually similar are presented as spatially proximate, whereas dissimilar images are distant. This allows outliers and unusual images to be detected easily, while differences between classes (for example, treatment versus control) or multiple classes within an experiment can be seen as spatial separation. Visualization of relationships and correlations among the data allow the user to find and define the unusual, the representative and broad patterns in the data.
Most of the above visualization schemes apply to cellular-level measurements of populations of cells, but none of these methods takes into consideration time-resolved data. Although the temporal evolution of one or several cellular or population features from a single experiment can easily be plotted over time, this approach is impractical when relationships between hundreds or thousands of experiments must be visualized. In this case, the time series can be ordered according to some similarity criterion and visualized as a color-coded matrix (). Similarly, heat maps can be extended to represent multidimensional time series (); the time series corresponding to different dimensions can be concatenated. Here, the most difficult part is to define an appropriate distance function for multidimensional time series according to whether absolute or relative temporal information is important64
Often, the time itself is less informative than the relative order in which events occur. In this case, it is also possible to estimate a representative order of events from the time-lapse experiments (for example, phenotypic events on single-cell level). This event order can be used for characterizing, grouping and visualizing experimental conditions, creating an event order map ()64