Metazoan cells have the ability to adopt an extraordinarily diverse spectrum of cell shapes. For example, the cuboidal, polarized morphology of epithelial cells differs markedly from that of neuronal cells, which extend long, thin, and highly-branched projections. The shape of an individual cell is the result of a complex interplay between the activity of thousands of genes and the cell's environment. Understanding this interplay is a fundamental challenge in developmental and cell biology. Currently, there are two key aspects to deciphering cellular morphogenesis on genome-scale. The first is determining the individual functional contributions of every gene towards the regulation of cell shape, and the second is to describe how complex relationships between cell shape genes affect morphology. With the advent of high-throughput RNA interference (RNAi) screening technologies, particularly in model systems such as Drosophila melanogaster
], it is now possible to systematically query the involvement of genes in the regulation of different cellular processes and functions. Typically, RNAi-based genetic screens involve the acquisition of relatively low-content, single-dimensional data which is easily analyzed using conventional and unbiased means and thus feasible to perform on genome, or multi-genome scales [1
]. In order to facilitate similar analysis of image-based screens, we and other researchers have recently developed novel image segmentation algorithms to rapidly quantitate hundreds of different parameters at a single-cell level in an automated fashion [3
], and we have demonstrated that such image segmentation algorithms can be used in the context of genetic screens [7
]. Notably however, this and other similar screens [8
] have been 50–100 fold smaller in scale than typical low-dimensional screens and are not yet genome-scale. The reduced scale of these screens is due, largely in part, to the fact that the expert opinion of cell biologists is still an essential and rate-limiting aspect in the analysis of many image-based datasets. Although human intervention is not required in screens where the potential phenotypic outcomes are few or binary in number (e.g. an image-based screen where a particular marker is determined to be nuclear or non-nuclear), such intervention is currently necessary in order to identify novel/subtle phenotypes in image-based datasets of genetic or chemical perturbations where the dynamic range of cellular phenotypes cannot be predicted before the data is collected. For example, in genome-scale screens for regulators of cell shape, it is impossible to predict a priori
the diversity of morphologies that will ultimately be present in the dataset. The failure to accurately measure this phenotypic variation will lead to concomitant classification errors, especially false negatives, and misleading results. Current methodologies usually employ a two-step procedure to maximize the amount of variation that is captured in a particular image-based analysis. First
, 100–600 phenotypic features are measured on a single-cell level (automatically but somehow exhaustively), and second
, supervised techniques assisted by biologists are used to both reduce dimensionality of feature space and carrying out classification on the images. The biologist has to at least perform preliminary qualitative visual scoring of a small part of the dataset in order to gain a crude assessment of the phenotypic variance that is present in this subset. Unfortunately, it is impossible to perform such analysis in the course of screens where millions of images are acquired, thus the ability of these screens to identify new phenotypes is greatly limited. The issues of defining meaningful phenotypes and describing them using informative feature subsets are closely related. Automated feature space reduction schemes have been implemented in the context of high content screen, including feature extraction methods examined in [9
], factor analysis in [10
] and SVM-RFE method in [11
]. These methods allow more effective modelling of existing phenotypes, and also prompt the necessity of updating informative feature sets so that they can not only model the existing, but also discover the novel.
Cluster analysis is widely used to reveal the structure of unlabeled datasets. Specifically, there are a number of methods that have been developed in order to estimate cluster numbers from a dataset such as using a series of internal indices [12
], jump methods [13
], and weighted gap statistics [14
]. Moreover, supervised approaches to cluster validation such as using re-sampling strategy [15
], prediction strength [16
], methods based on mixture models and inference of Bayesian factors[17
], or strategies which are application-specific [19
] have also been previously implemented. Nevertheless, most existing methods are subject to certain hypothesis on a fixed dataset, and cannot be directly used for online phenotype discovery where new images continuously extend the dataset and millions of cells are involved. Improper assumptions on data structure may cause incorrect division or merging of biologically meaningful phenotypes. To avoid this problem, such methods combine each new image with the whole existing dataset (regardless of the large difference in cell numbers) and frequently re-run from the very beginning.
Methods for online phenotype discovery should be sensitive and flexible to various phenotypes and avoid frequent re-modelling involving complete existing datasets. As a kernel machine based novelty detection method, one-class SVM is used for "off-line" phenotype discovery [20
]. However, two major points limit its application to high-throughput image-based screens, especially for screens of cell shape regulators. First
, in one-class SVM all the test samples are classified into two classes, "novel" and "known", however many high-throughput RNAi datasets may potentially contain multiple
diverse and unique novel phenotypes which should not necessarily be grouped together. Subsequent cluster analysis would be needed to identify and model different novel phenotypes following the use of one-class SVM. Second
, each time a novel phenotype is discovered using one-class SVM, the support vectors need to be modified so that the newly discovered phenotype are included as "known" in the following loops, otherwise it will continuously be identified as novel in future. As mentioned earlier, in a typical RNAi screen on 1,000–10,000s genes with dozens of images for each RNAi and 100s of cells in each image, such updating would involve millions of cells and is intractable.
Here we describe the development of an online phenotype discovery pipeline that we implemented in the context of a high-throughput image-based RNAi screen for regulators of cell shape. A simplified scheme of online phenotype discovery is shown in Figure . Online phenotype discovery demands adaptively identifying various novel phenotypes based on multiple existing phenotypes (e.g. those identified a priori
by biologists), being sensitive and flexible to various new phenotypes and avoiding frequent re-modelling using large existing dataset. Our method includes two key components: phenotype modelling and iterative cluster merging. First
, a Gaussian Mixture Model (GMM) is estimated for each existing phenotype following [21
, iterative cluster merging are performed based on gap statistics. When a new image is incorporated, we sample the GMM of each existing phenotypes and start a series of merging loops. In each loop, the image is combined with sample set for one existing phenotype and we estimate cluster number in such combined dataset using gap statistics and use GMM of existing phenotype as part of the reference distributions. If some cells in the new image are clustered together with samples from the existing phenotype, they are merged into the existing phenotype, i.e. they are included into the dataset of existing phenotype and deleted from the new image. The iterations continue until sample set from each existing phenotype has been combined with the new image and has merged with its counterpart (if any exist). Upon completion of all loops, the remaining cell groups in the new image are identified as the candidate of new phenotypes. By sampling reference dataset from new image and existing phenotype separately, utilizing the GMM for existing phenotypes as (part of) reference distribution and involving existing clusters one by one, our method improves the ideas in [12
] and becomes more effective. Experimental results show that the proposed method is robust and efficient for online phenotype modelling and discovery in the context of diverse image-based screens, especially RNAi screens on Drosophila
Tasks and simple scheme of online phenotype discovery.