Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
IEEE NIH Life Sci Syst Appl Workshop. Author manuscript; available in PMC 2010 April 28.
Published in final edited form as:
IEEE NIH Life Sci Syst Appl Workshop. 2009 April 9; 2009: 96–99.
doi:  10.1109/LISSA.2009.4906718
PMCID: PMC2860574

An Image Informatics Method for Automated Quantitative Analysis of Phenotype Visual Similarities


The post genomic era introduced the need to define single gene functions within biological pathways. A systems biology approach can be realized by automating image acquisition and phenotype classification. While machinery for automated data acquisition have been developing rapidly in the past years, the main bottleneck remains the effectiveness of the computer vision algorithms. Here we describe a fully automated process for finding phenotype similarities within a dataset acquired from an RNAi screen. The source code for the algorithms is available for free download.

I. Introduction

In the past few years, pipelines providing high-throughput biological imaging have been becoming increasingly popular, and are supported by the increasing availability of automated microscopy and high-performance computing. The availability of such instruments enables quantitative measurements of phenotype similarities in very large datasets. Applications include profiling drug responses [9], screening for small molecules [16], and classification of sub-cellular localization [2].

The availability of full genome sequences introduces the opportunity to perform large-scale analysis of gene functionality and reveal relationships between gene function and phenotype traits [11], [10]. However, due to the complex nature of the data, a successful implementation is subjected to the limitations of the computer vision algorithms, manifesting a large barrier to overcome.

Image analysis based on a specific morphology of the cell such as the size or shape [5], [6] may not provide clear relationships between gene knockout and function due to the variety of the terminal phenotypes expected. Instead, the algorithms should be able to handle phenotype similarities in a more general sense, covering a wide range of phenotype instances.

Here we describe an automated method that can be used for the purpose of automatic mining for phenotype similarities. This is a new approach to finding genes with similar functionality, and is different from finding gene similarity by comparing sequences (e.g., BLAST).

II. Methods

For the purpose of measuring phenotype similarities, we used WND-CHARM image classification algorithm. WND-CHARM [12], [14] makes use of a large set of 1025 image features extracted from each image. Each image feature (or a set of image features) can be found useful in finding similarities (or differences) between several different types of images.

Image features extracted by WND-CHARM can be divided into four groups: High contrast features, which include edge and object information; Polynomial decomposition, which is statistics based on the polynomial representation of the image; Statistics, which include multi-scale histograms and moments; and textures, which include Tamura [15] and Haralick [7] textures. This set of image features is computed by WND-CHARM on the raw pixels, but also on the Fourier, Chebyshev and Wavelet (Symlet 5) transforms of the image, and also on several compound transforms (Fourier-Wavelet and Fourier-Chebyshev). This variety of image features makes WND-CHARM effective for finding similarities between the different phenotypes, not known prior to the experiment. A more detailed description of WND-CHARM can be found at [12], [14], and the MATLAB and C implementations of the algorithm are available for free download at

Before image features are computed, each image is broken into 16 equal-sized tiles, and image features are computed for each tile separately. Then, the images are split into a training set and test set with two thirds of the images in the training set, and the remaining images used to evaluate the trained classifier.

Since very many feature values are being computed, some features are assumed to represent noise. To increase signal and remove noise, the features ranked by their discriminative power using simple Fisher scores [1]. The lowest 35% of the features are rejected. Once Fisher scores are computed, the weighted Euclidean distance dt,s between a tile t from the training set and a tile s from the test set is computed by dt,s = Σf[set membership]F wf(tfsf)2, such that F is the set of 1025 image features, tf and sf are the values of image feature f in tile t and s, respectively, and wf is the Fisher score of feature f.

Phenotype similarities are determined based on how images in the test set are classified using the images in the training set. When an image from the test set is classified, each of its 16 tiles is assigned with a similarity value to each of the genes in the training set. This is performed using a Weighted Nearest Neighbor rule [3], such that the similarity value tg of tile t to gene g is tg = (1/dg)/Σi<G 1/di, where dg is the distance from tile t to its closest tile of gene g in the training set, and G is the set of all genes participating in the experiment. The similarity values of an image to any of the genes is computed simply by summing the 16 similarity vectors of the 16 tiles. The sum improves the signal-to-noise ratio of image similarity, compared to the similarity vector computed from one tile.

The resulting phenotype similarity values are computed by averaging the similarity values of all images for each of the genes participating in the experiment. That is, the similarity of the phenotype produced by knocking down gene g1 to the phenotype produced by knocking down gene g2 is the average similarity values of all images of gene g1 to images of gene g2. This results in a matrix of similarity values between all pairs of genes. This similarity matrix can be used for finding similar phenotypes that were produced by knocking down different genes, those findings may be used to reconstruct biological pathways.

Manually observing the similarity matrix and searching for high similarity values can become an exhausting task, especially when very many genes are involved. In order to make this task more convenient, we visualize the phenotype similarities using phylogenies (evolutionary trees) inferred automatically by Phylip package [4]. The phylogenies provide a tree of phenotypes with the lengths of the edges correlated with how similar the phenotypes are reflecting the values taken from the similarity matrix.

Deducing image similarities is considered a complicated task for computer programs due to the complex nature of the data, and therefore the presence of noise in the similarity values is unavoidable. Due to the noise generated by the image classifier, some of the phenotype similarities might not be symmetric. That is, the similarity of g1 to g2 may be 0.9, while the similarity of g2 to g1 is 0.86. Since the distances in the phylogeny are undirected, we simply average the two values to obtain one distance value between the two genes.

III. Results

To assess the efficacy of the image analysis we utilized a small dsRNA library (Open Biosystems) to cause single gene knockdown in cultured Drosophila cells. The library included 14 genes as listed in Table I, and can be divided into five expected phenotypic classes, which are Apoptotic (dIAP1), G1 arrest (pavarotti, CyclinE, MCM2, Rad17), G1 delay (MAPk-AK2), DNA damage (p38-MPK2, FANC-M, Cul-4) and one unknown Phenotype (CHD3).

Genes included in the tested dsRNA library

Each gene had 50 experiments done on the same slide. After fixation, cells were stained with DAPI, washed and mounted, then deconvloved 1024×1024 images (one per experiment) were acquired using a Deltavision (Applied Precision, Inc., Issaqua, WA) microscope setup. Sample images are shown by Figure 1.

Fig. 1
Sample images of genes Pavarotti (a), CyclineE (b), p38-MPK2 (c) and untreated cells (d)

After the images were acquired, WND-CHARM image classifier provided the following phenotype similarity matrix, shown by Table II. The values for each gene are normalized such that the similarity of each gene to itself is 1.

Phenotypes similarity values computed by WND-CHARM

Figure 2 shows the corresponding phylogeny that visualizes the similarities values of Table II. As can be seen in Figure 2, the proposed method detected very similar phenotypes for gene 11 and 12. This observation is backed up by a clear link to previously reported studies, indicating that gene 11 (Dmp53) is a substrate for gene 12 (Loki).

Fig. 2
The phylogeny of phenotype similarities generated from the similarity values of Table II.

Untreated cells (14) were not found similar to any of the other phenotypes, and so were genes 2 and 9, which dont have a similar phenotypes in the tested group of genes. Gene 13 (dIAP1) causes cell death, and was also found by the proposed method not to share similarities with any of the other tested genes. Gene 3 (CHD1) does not have a well-defined phenotype reported in the literature, and does not appear to be associated with any of the tested genes.

IV. Conclusions

Mining for gene similarities has been attracting a considerable attention in the field of bioinformatics. Due to difficulties in processing and comparing large sets of different phenotypes, most attempts of finding genes with similar functionality are based on sequence analysis methods (e.g., BLAST). These methods heavily rely on the contention that genes with similar functionality should also have similar sequences, generating similar proteins. However, in many cases genes with different sequences can be part of the same biological mechanisms [8], [13].

Here we described an automated process that can be used for the purpose of automatic mining of phenotype similarities. This is a new approach of finding genes with similar functionality, and is different from finding gene similarity by comparing sequences.

Clearly, the proposed method can only sense phenotypic features that are visible using a microscope, and due to the complex nature of the quantification of the phenotype morphology it is not expected to detect all genes with similar functionality. However, given that very many of the genes in any organism are not mapped to any known function, applying this method on large sets of phenotypes with single gene knockdown can potentially reconstruct biological pathways.


This research was supported entirely by the Intramural Research Program of the NIH, National Institute on Aging.


1. Bishop CM. Pattern Recognition and Machine Learning. Springer; 2006.
2. Boland MV, Murphy RF. neural network classifier capable of recognizing the patterns of all major subcellular structures in fluorescence microscope images of HeLa cells. Bioinformatics. 2001;17:1213–1223. [PubMed]
3. Brown TKJ. The weighted nearest neighbor rule for class dependent sample sizes. IEEE Trans on Information Theory. 1979;25:617–619.
4. Felsenstein J. PHYLIP Phylogeny Inference Package, Version 36. 2004
5. Fraser AG, et al. Functional genomic analysis of C. elegans chromosome I by systematic RNA interference. Nature. 2000;408:325–330. [PubMed]
6. Giaever G, et al. Functional profiling of the Saccharomyces cerevisiae genome. Nature. 2002;418:387–391. [PubMed]
7. Haralick RM, Shanmugam K, Dinstein I. Textural Features for Image Classification. IEEE Tran on Systems, Man, and Cybernetics. 1973;6:269–285.
8. Lettre G, et al. Identification of ten loci associated with height highlights new biological pathways in human growth. Nat Genet. 2008;40:584–591. [PMC free article] [PubMed]
9. Loo L, Wu LF, Altschuler SJ. Image-based multivariate profiling of drug responses from single cells. Nature Methods. 2007;4:445–453. [PubMed]
10. Pepperkok R, Ellenberg J. High-throughput fluorescence microscopy for systems biology. Nature Reviews Molecular Cell Biology. 2006;7:690–696. [PubMed]
11. Ohya Y, et al. High-dimensional and large-scale phenotyping of yeast mutants. Proc Natl Acad Sci. 2005;102:19015–19020. [PubMed]
12. Orlov N, Shamir L, Macura T, Johnston J, Eckely DM, Goldberg I. WND-CHARM: Multi-purpose image classification using compound image transforms. Pattern Recognition Letters. 2008;29:1684–1693. [PMC free article] [PubMed]
13. Sanna S, et al. Common variants in the GDF5-UQCC region are associated with variation in human height. Nat Genet. 2008;40:198–203. [PMC free article] [PubMed]
14. Shamir L, Orlov N, Macura T, Eckley DM, Johnston J, Goldberg IG. Wndchrm - An Open Source Utility for Biological Image Analysis. BMC Source Code for Biology and Medicine. 2008;3:13. [PMC free article] [PubMed]
15. Tamura H, Mori S, Yamavaki T. Textural features corresponding to visual perception. IEEE Trans on Syst Man and Cyber. 1978;8:460–472.
16. Tanaka M, et al. An unbiased cell morphology-based screen for new, biologically active small molecules. PLoS Bio. 2005;3:e128. [PMC free article] [PubMed]