|Home | About | Journals | Submit | Contact Us | Français|
The post genomic era introduced the need to define single gene functions within biological pathways. A systems biology approach can be realized by automating image acquisition and phenotype classification. While machinery for automated data acquisition have been developing rapidly in the past years, the main bottleneck remains the effectiveness of the computer vision algorithms. Here we describe a fully automated process for finding phenotype similarities within a dataset acquired from an RNAi screen. The source code for the algorithms is available for free download.
In the past few years, pipelines providing high-throughput biological imaging have been becoming increasingly popular, and are supported by the increasing availability of automated microscopy and high-performance computing. The availability of such instruments enables quantitative measurements of phenotype similarities in very large datasets. Applications include profiling drug responses , screening for small molecules , and classification of sub-cellular localization .
The availability of full genome sequences introduces the opportunity to perform large-scale analysis of gene functionality and reveal relationships between gene function and phenotype traits , . However, due to the complex nature of the data, a successful implementation is subjected to the limitations of the computer vision algorithms, manifesting a large barrier to overcome.
Image analysis based on a specific morphology of the cell such as the size or shape ,  may not provide clear relationships between gene knockout and function due to the variety of the terminal phenotypes expected. Instead, the algorithms should be able to handle phenotype similarities in a more general sense, covering a wide range of phenotype instances.
Here we describe an automated method that can be used for the purpose of automatic mining for phenotype similarities. This is a new approach to finding genes with similar functionality, and is different from finding gene similarity by comparing sequences (e.g., BLAST).
For the purpose of measuring phenotype similarities, we used WND-CHARM image classification algorithm. WND-CHARM ,  makes use of a large set of 1025 image features extracted from each image. Each image feature (or a set of image features) can be found useful in finding similarities (or differences) between several different types of images.
Image features extracted by WND-CHARM can be divided into four groups: High contrast features, which include edge and object information; Polynomial decomposition, which is statistics based on the polynomial representation of the image; Statistics, which include multi-scale histograms and moments; and textures, which include Tamura  and Haralick  textures. This set of image features is computed by WND-CHARM on the raw pixels, but also on the Fourier, Chebyshev and Wavelet (Symlet 5) transforms of the image, and also on several compound transforms (Fourier-Wavelet and Fourier-Chebyshev). This variety of image features makes WND-CHARM effective for finding similarities between the different phenotypes, not known prior to the experiment. A more detailed description of WND-CHARM can be found at , , and the MATLAB and C implementations of the algorithm are available for free download at http://www.openmicroscopy.org.
Before image features are computed, each image is broken into 16 equal-sized tiles, and image features are computed for each tile separately. Then, the images are split into a training set and test set with two thirds of the images in the training set, and the remaining images used to evaluate the trained classifier.
Since very many feature values are being computed, some features are assumed to represent noise. To increase signal and remove noise, the features ranked by their discriminative power using simple Fisher scores . The lowest 35% of the features are rejected. Once Fisher scores are computed, the weighted Euclidean distance dt,s between a tile t from the training set and a tile s from the test set is computed by dt,s = ΣfF wf(tf − sf)2, such that F is the set of 1025 image features, tf and sf are the values of image feature f in tile t and s, respectively, and wf is the Fisher score of feature f.
Phenotype similarities are determined based on how images in the test set are classified using the images in the training set. When an image from the test set is classified, each of its 16 tiles is assigned with a similarity value to each of the genes in the training set. This is performed using a Weighted Nearest Neighbor rule , such that the similarity value tg of tile t to gene g is tg = (1/dg)/Σi<G 1/di, where dg is the distance from tile t to its closest tile of gene g in the training set, and G is the set of all genes participating in the experiment. The similarity values of an image to any of the genes is computed simply by summing the 16 similarity vectors of the 16 tiles. The sum improves the signal-to-noise ratio of image similarity, compared to the similarity vector computed from one tile.
The resulting phenotype similarity values are computed by averaging the similarity values of all images for each of the genes participating in the experiment. That is, the similarity of the phenotype produced by knocking down gene g1 to the phenotype produced by knocking down gene g2 is the average similarity values of all images of gene g1 to images of gene g2. This results in a matrix of similarity values between all pairs of genes. This similarity matrix can be used for finding similar phenotypes that were produced by knocking down different genes, those findings may be used to reconstruct biological pathways.
Manually observing the similarity matrix and searching for high similarity values can become an exhausting task, especially when very many genes are involved. In order to make this task more convenient, we visualize the phenotype similarities using phylogenies (evolutionary trees) inferred automatically by Phylip package . The phylogenies provide a tree of phenotypes with the lengths of the edges correlated with how similar the phenotypes are reflecting the values taken from the similarity matrix.
Deducing image similarities is considered a complicated task for computer programs due to the complex nature of the data, and therefore the presence of noise in the similarity values is unavoidable. Due to the noise generated by the image classifier, some of the phenotype similarities might not be symmetric. That is, the similarity of g1 to g2 may be 0.9, while the similarity of g2 to g1 is 0.86. Since the distances in the phylogeny are undirected, we simply average the two values to obtain one distance value between the two genes.
To assess the efficacy of the image analysis we utilized a small dsRNA library (Open Biosystems) to cause single gene knockdown in cultured Drosophila cells. The library included 14 genes as listed in Table I, and can be divided into five expected phenotypic classes, which are Apoptotic (dIAP1), G1 arrest (pavarotti, CyclinE, MCM2, Rad17), G1 delay (MAPk-AK2), DNA damage (p38-MPK2, FANC-M, Cul-4) and one unknown Phenotype (CHD3).
Each gene had 50 experiments done on the same slide. After fixation, cells were stained with DAPI, washed and mounted, then deconvloved 1024×1024 images (one per experiment) were acquired using a Deltavision (Applied Precision, Inc., Issaqua, WA) microscope setup. Sample images are shown by Figure 1.
After the images were acquired, WND-CHARM image classifier provided the following phenotype similarity matrix, shown by Table II. The values for each gene are normalized such that the similarity of each gene to itself is 1.
Figure 2 shows the corresponding phylogeny that visualizes the similarities values of Table II. As can be seen in Figure 2, the proposed method detected very similar phenotypes for gene 11 and 12. This observation is backed up by a clear link to previously reported studies, indicating that gene 11 (Dmp53) is a substrate for gene 12 (Loki).
Untreated cells (14) were not found similar to any of the other phenotypes, and so were genes 2 and 9, which dont have a similar phenotypes in the tested group of genes. Gene 13 (dIAP1) causes cell death, and was also found by the proposed method not to share similarities with any of the other tested genes. Gene 3 (CHD1) does not have a well-defined phenotype reported in the literature, and does not appear to be associated with any of the tested genes.
Mining for gene similarities has been attracting a considerable attention in the field of bioinformatics. Due to difficulties in processing and comparing large sets of different phenotypes, most attempts of finding genes with similar functionality are based on sequence analysis methods (e.g., BLAST). These methods heavily rely on the contention that genes with similar functionality should also have similar sequences, generating similar proteins. However, in many cases genes with different sequences can be part of the same biological mechanisms , .
Here we described an automated process that can be used for the purpose of automatic mining of phenotype similarities. This is a new approach of finding genes with similar functionality, and is different from finding gene similarity by comparing sequences.
Clearly, the proposed method can only sense phenotypic features that are visible using a microscope, and due to the complex nature of the quantification of the phenotype morphology it is not expected to detect all genes with similar functionality. However, given that very many of the genes in any organism are not mapped to any known function, applying this method on large sets of phenotypes with single gene knockdown can potentially reconstruct biological pathways.
This research was supported entirely by the Intramural Research Program of the NIH, National Institute on Aging.