Bioimage informatics has been established as a new branch in the tree of bioinformatics' fields of research in the last 10 years. The term bioimage comprises all kinds of images generated for biological samples in a biological or biomedical research context using a large diversity of imaging techniques. The techniques range from standard ones such as bright field imaging or phase contrast to advanced technologies that enable recording many molecular variables for each resolvable volume unit. The latter group of technologies can also be referred to as multivariate bioimages (MBIs; Herold et al., 2011
). MBI belong to the so called high-content imaging techniques which apply high resolution imaging in time and/or space and/or variables to close those open gaps in systems biology which cannot be bridged by standard, i.e. non-spatial omics techniques (Megason and Fraser, 2007
; Starkuviene and Pepperkok, 2007
). While these can in principle resolve the almost complete molecular composition in a sample on different levels (genomics, transciptomics, proteomics and metabolomics) they have to leave out the spatial domain. In contrast to that, bioimaging approaches, which usually work with a lower level of molecular resolution, can relate molecular information to spatial features such as morphology.
Typical examples for MBI are Matrix Assisted Laser Desorption / Ionization (MALDI) imaging (Cornett et al., 2007
), vibrational spectroscopy/Raman microscopy (van Manen et al., 2005
) or MultiEpitope-Ligand Cartography (MELC)/Toponome Imaging System (TIS) (Schubert et al., 2006
). The first two techniques measure molecular features and interactions in localized spectra, arranged in a pixel grid. The interpretation of the obtained images aims at the identification of pixel groups that share particular or similar spectral features (e.g. Alexandrov et al., 2010
) where as the final identification of molecules and a semantic interpretation remains an unsolved problems for most applications. In contrast to that, MELC/TIS (for the sake of compactness we will refer to this technique with TIS) imaging aims at the imaging of a selected set of N
proteins using a library of N
fluorescent labeled antibodies, lectins or other specific ligands (referred to as tags, in general) in combination with a cyclic protocol of staining, fluorescence imaging and soft bleaching. To unfold the full potential of all these kinds of MBI, new algorithms and software are needed that allow researchers to visually explore the data and to identify the hidden regularities. In this article, we will focus on images recorded using the TIS technology, however our method is definitely applicable to other MBI data recorded with a different multitag technology or MALDI images as well.
For one selected field of view (FOV) in the sample, TIS records one multivariate image T(s) which consists of a set of N aligned images g(s)a(x, y)a=1,…,N (with x, y as pixel coordinates) with s (s=1,…, S) describing the ID of the TIS image/FOV and g(s)a denoting the fluorescence gray value image for tag a. In practice, a number of S TIS runs with one library of N>10 tags are applied to record a set of S datasets. With gx,y=(g1, g2,…, gN)x,y we will refer to the N gray values for the respective N tags assigned to one pixel (x, y) in a TIS image T(s). To align the N fluorescence images in one TIS image, phase contrast images are recorded in each cycle and used as a reference.
One TIS image or a set
TIS images resembles a high-dimensional complex data structure that encodes hidden relationships between colocation of proteins and the spatial distribution pattern, which is also referred to as the toponome
(Schubert et al., 2006
). While on the one hand, the gain in molecular information through toponome data may undoubtedly have the potential to lead to a new understanding of functional molecular networks, the analysis of TIS data represents a new challenging problem with a large number of open issues for bioimage informatics on the other hand. It is evident that by visual inspection of each one of the N
single gray value images, colocation of proteins can hardly be identified. Likewise, iteratively superimposing three out of the N
images or even all images to obtain RGB fusion images is not feasible for protein network identification since an observer would need to analyze a number of N
−3)!) visualizations and link the results obtained for each image triplet, which is impossible for human observers.
One straightforward way to reduce the complexity of the data is to apply a threshold to each image. Schubert et al. (2006
) applied such a method for pixel-wise extraction of binary colocation and anti-colocation vectors, termed combinatorial molecular phenotypes
(CMPs), by manually thresholding each image ga(s)
for a combinatorial analysis. Random colors are subsequently assigned to each of the n
detected CMPs to construct so called toponome maps
which encode the spatial location of each CMP with its individual color. Although the concept of binary CMPs has the advantage of a fundamental reduction of data complexity and a clear interpretation on the level of a single CMP, thresholding each image by manual human interaction features several disadvantages. It is quite time consuming and requires a high level of expertise to set reasonable thresholds. Slight modifications of the threshold can lead to different CMP lists, potentially affecting the interpretation of the data. Furthermore, thresholding discards information inherent in the data, so analyzing non-binarized gray value images may be better suited to track protein locations in the cell (Friedenberger et al., 2007
). However, the CMP concept has successfully been applied in several studies (Bhattacharya et al., 2010
; Bonnekoh et al., 2006
; Eyerich et al., 2009
; Ruetze et al., 2010
), for example revealing proteins controlling the molecular networks of tumor cell lines, or finding CMPs to distinguish between healthy patients, patients with psoriasis and patients with atopic dermatitis. But even regardless of the aforementioned thresholding issue, we believe that the CMP-based visualization concept should be reconsidered as follows. From a visualization point of view, mapping the CMP to random colors follows the idea to treat CMP as nominal
variables. On the one hand, this perspective on a colocation pattern is well motivated since similar patterns (CMPs) can constitute different functions (similarity may be quantified using the Hamming distance for binary patterns). But on the other hand, one should also bear in mind that similar patterns may also belong to the same functional group or to the same hierarchically organized network. Another drawback of using random colors for CMPs is that the morphological structure in a random color map can be hard to interpret since the colorful map can overburden the cognitive skills of a user. So an alternative visualization concept is definitely needed, that maps similar patterns to similar colors. In other words, one needs a pseudocoloring that preserves the topology of the N
-dimensional fluorescence colocation feature space. In summary, a new method for visual data mining TIS images is needed that features he following. First it has to provide an overview on the entire image using a pseudocolor visualization. Second, it has to support the identification and display of relevant gray value-based protein colocation patterns, referred to as MCEPs (Molecular Co-Expression Phenotypes). Third, the perception of similarities and contrasts in the expressed MCEPs must be possible. Fourth, filtering and zooming must be supported in both domains, tissue morphology and protein colocation.
In this article, we present the visual data mining tool WHIDE (Web-based Hyperbolic Image Data Explorer), which offers the four functions listed above. The idea behind WHIDE is to identify MCEP in TIS images using a special variant of the self-organizing map, the hierarchical hyperbolic self-organizing map (H2
SOM), in combination with state-of-the-art internet browser technology and information visualization concepts. Compared with standard SOMs, hyperbolic SOMs have the potential to achieve much better low-dimensional embeddings, since they offer more space due to the effect, that in a hyperbolic plane the area of a circle grows asymptotically exponential with its radius (see Supplementary Material
for details). This feature has been identified as a solution to the so called focus and context
problem in information visualization (Ware, 2004
) by other researchers as well, like in the famous hyperbolic tree browser (Lamping et al., 1995
). The tool is integrated in our full-web-based online bioimage analysis platform BioIMAX (BioImage Mining, Analysis and eXploration; Loyek et al., 2011
) which uses state-of-the-art web graphics tool kits to realize an online bioimage analysis workbench as a Rich Internet Application (RIA) (see access details given above and details given in the Supplementary Material