Examining cells by microscopy has long been a primary method for studying cellular function. When cells are stained appropriately, visual analysis can reveal biological mechanisms. Advanced microscopes can now, in a single day, easily collect thousands of high resolution images of cells from time-lapse experiments and from large-scale screens using chemical compounds, RNA interference (RNAi) reagents, or expression plasmids [1
]. However, a bottleneck exists at the image analysis stage. Several pioneering large screens have been scored through visual inspection by expert biologists [6
], whose interpretive ability will not soon be replicated by a computer. Still, for most applications, image cytometry (automated cell image analysis) is strongly preferable to analysis by eye. In fact, in some cases image cytometry is absolutely required to extract the full spectrum of information present in biological images, for reasons we discuss here.
First, while human observers typically score one or at most a few cellular features, image cytometry simultaneously yields many informative measures of cells, including the intensity and localization of each fluorescently labeled cellular component (for example, DNA or protein) within each subcellular compartment, as well as the number, size, and shape of those subcellular compartments. Image-based analysis is thus versatile, inherently multiplexed, and high in information content. Like flow cytometry, image cytometry measures the per-cell amount of protein and DNA, but can more conveniently handle hundreds of thousands of distinct samples and is also compatible with adherent cell types, time-lapse samples, and intact tissues. In addition, image cytometry can accurately measure protein texture and localization as well as cell shape and size.
Second, human-scored image analysis is qualitative, usually categorizing samples as 'hits' (where normal physiology is grossly disturbed) or 'non-hits'. By contrast, automated analysis rapidly produces consistent, quantitative measures for every image. In addition to uncovering subtle samples of interest that would otherwise be missed, systems-level conclusions can be drawn directly from the quantitative measures for every image. Measuring a large number of features, even features undetectable by eye, has proven useful for screening as well as cytological/cytometric profiling, which can group similar genes or reveal a drug's mechanism of action [3
Third, image cytometry individually measures each cell rather than producing a score for the entire image. Because individual cells' responses are inhomogeneous [15
], multiparametric single cell data from several types of instruments have proven much more powerful than whole-population data (for example, western blots or mRNA expression chips) for clustering genes, deriving causal networks, classifying protein localization, and diagnosing disease [10
]. In addition, individual cell measurements can reveal samples that differ in only a subpopulation of cells, which would otherwise be masked in whole-population measures.
Fourth, quantitative image analysis is able to detect some features that are not readily detectable by a human observer. For example, the two-fold difference in DNA staining intensity that reveals whether a cell is in G1 or G2 phase of the cell cycle are measurable by computer but are difficult for the human eye to observe in cell images. Furthermore, small but biologically significant differences, for example, a 10% increase in nucleus size, are not noticeable by eye. Other features, for example, the texture (smoothness) of protein or DNA staining, are observable but not quantifiable by eye. Pathologists have known for years that changes in DNA or protein texture can correlate to profound and otherwise undetectable changes in cell physiology, a fact used in diagnosis of disease [17
]. Even changes not visible to the human eye can reveal disease state [20
Fifth, image cytometry is much less labor-intensive and higher-throughput. Appropriate software produces reliable results from a large-scale experiment in hours, versus months of tedious visual inspection. This improvement is more than an incremental technical advance, because it relieves the one remaining bottleneck to routinely conducting such experiments.
Prior to the work presented here, the only flexible, open-source biological image analysis package was ImageJ/NIH Image [21
]. This package has been successfully used by many laboratories. Its design, however, is geared more towards the analysis of individual images (comparable to Adobe Photoshop) rather than flexible, high-throughput work. Macros can be written in ImageJ for high-throughput work but adapting macros to new projects requires that biologists learn a programming language.
While not creating a general, flexible software tool, many groups have benefited from automated cell image analysis by developing their own scripts, macros, and plug-ins to accomplish specific image analysis tasks. Custom programs written in commercial software (for example, MetaMorph, ImagePro Plus, MATLAB) or Java have been used to identify, measure, and track cells in images and time lapse movies [10
]. Such studies clearly show the power of automated image analysis for biological discovery. However, most of these custom programs are not modular, so combining several steps and changing settings requires interacting directly with the code and is simply not practical for routinely processing hundreds of thousands of images or sending jobs to a cluster. The effort expended by laboratories in creating an analysis solution with a particular software package is often lost after the initial experiment is completed; other laboratories rarely use the methods because they are customized for a particular cell type, assay or even image set. Furthermore, although developing a routine for a new cell type or assay usually requires testing multiple algorithms, it is impractical to implement and test several published methods for a particular project.
Commercial software has also been developed, mainly for the pharmaceutical screening market, by companies including Cellomics, TTP LabTech, Evotec, Molecular Devices, and GE Healthcare [24
]. Development of these packages has been guided mainly by mammalian cell types and cellular features of pharmaceutical interest, including protein translocation, micronucleus formation, neurite outgrowth, and cell count [25
]. The high cost and the bundling of commercial software with hardware makes it impractical to test several programs for a new project. The proprietary nature of the code prevents researchers from knowing the strategy of a given algorithm and it cannot be modified if desired. As is the case with many laboratories, we have found commercial packages useful for some screens in mammalian cells, but in other cases limiting [1
Furthermore, key challenges remain in image analysis algorithm development itself [28
]. Cell image analysis has been described as one of the greatest remaining challenges in screening [5
], and as a field is "very much in its infancy" [30
] and "lag [s] behind the adoption of high-throughput imaging technologies" [10
]. Accurate cell identification is required to extract meaningful measures from images, but even for mammalian cell types, existing software often fails on crowded cell samples, which has severely limited screens thus far. Screens in most non-mammalian organisms have been limited to visual inspection.
In summary, while existing software enables particular assays for particular cell types, high throughput image analysis has, to this point, been impractical unless an image analysis expert develops a customized solution, or unless commercial packages are used with their built-in algorithms for a limited set of cellular features and for a limited set of cell types. There exists a clear need for a powerful, flexible, open-source platform for high-throughput cell image analysis.
Here we describe the open-source CellProfiler project, our effort to develop such a software system for the scientific community. CellProfiler simultaneously measures the size, shape, intensity and texture of a variety of cell types in a high throughput manner. Note that we focus in this paper not on the technical details of the software (which are described in the manual), nor computational validation of the mostly published algorithms, nor on a mechanistic study of any particular biological finding. Rather, we describe the system, validate the software for a variety of real-world biological problems, demonstrate the breadth of its utility (including on various cell types and assays), and hope to stimulate ideas within the biological community for future applications of the software.