Any model of cellular structures should give a description of the statistical relationship between the variables of interest, i.e. a mathematical description of the probability of any combination of values assigned to these variables. These will allow evaluation of the likely behavior of one set of variables given conditions on another. For example, the microtubule catastrophe rate might be more likely to be lower in mitosis compared to interphase, with exceptions being due to cell-to-cell variability. The model might just include the mean rates for mitosis and interphase, or it could additionally contain the standard deviation of each rate. The latter model states that the catastrophe rate of interphase cells is normally distributed with that mean and standard deviation (and similarly for mitotic cells). Note that models may not explicitly state their statistical assumptions but still have them. For example, the common differential equation-based model representing protein interactions specifies that the rate at which a particular protein’s concentration changes is solely a function of its and other proteins’ concentrations and is unaffected by random noise [23
Models have some number of parameters, e.g. the mean catastrophe rates in interphase and mitosis, that allow them to represent a range of behaviors or patterns. A specific behavior is selected with a corresponding set of values for those parameters, e.g. lower rates in mitosis. These parameters can be chosen automatically by statistically estimating them from collected measurements (called training data) of the system of interest. Data are processed into a consistent, comparable and measurable form, usually vectors of numeric values of a specified length (called feature vectors), e.g. multiple measurements of the length of a microtubule over time (from which catastrophe rate could be inferred).
There are two major categories of statistical models, discriminative
] and generative
], and both have been applied to the study of subcellular organization and protein patterns. Discriminative
models only represent the probability of a feature vector being from each particular pattern and explicitly do not consider the physical or biological mechanism by which the measurements (images) were generated. We can only ask of a discriminative model: how likely is it that this feature vector comes from a particular pattern?
Generative models, on the other hand, also represent the probability of observing a particular feature vector when it comes from a particular pattern, and they even commonly contain variables that are not measured (latent variables). An example of a latent variable would be microtubule catastrophe rate in the case that the only variables measured were the lengths of the microtubules in the cell. Thus, a different query can be made of a generative model: what are examples of images I would expect for a particular protein in a given cell type under a specific condition?
As an illustrative example of the advantages of using a generative model, suppose that we wish to create a simulation in which the distributions of many proteins are represented in a single cell. We could measure this by imaging all of them simultaneously. However, technology for imaging more than a few proteins in the same sample is not available, especially for live cells. Even if we wish to build up a model from measurements of colocalization of subsets of the proteins, there are too many combinations to feasibly image (for 1,000 proteins, there are 166 million combinations of three proteins). However, generative models can approximate the colocalization of these proteins. We can build a generative model of the pattern of each protein individually that depends only on the cellular and nuclear shapes of the cell. We can then hypothesize that proteins with similar model parameters are colocalized, and can create synthetic cell images in which these proteins are placed in the same structures. This gives us an image of the same cell showing the locations of many proteins. Extending this model to include dependency on structures other than the nuclear and cell membranes, such as the cytoskeleton, would give an even more accurate synthetic cell.
Generative models can also be used to find probable values for latent variables. While a discriminative model may easily distinguish between two images of microtubules from different conditions (say wild type and treated with nocodazole, which depolymerizes microtubules), a generative model can include latent variables parameterizing the process of microtubule growth, e.g. number and average length of microtubules. Thus, unlike with discriminative models, learning the parameters of the generative model could encode the basis for the differences between patterns.
As previously proposed [11
], models of cellular structures ideally should be:
- automated: learnable automatically from images;
- generative: able to synthesize new, simulated images displaying the specific pattern(s) learned from images;
- statistically accurate: able to capture pattern variation between cells;
- compact: representable by a small number of parameters and communicable with significantly fewer bits than the training images.
In the context of this paper, one key issue in building models of cells is that they need to contain modular pieces which depend on each other in order to capture correlations between structures of the cell when synthesizing an image. For example, endosomes lie between the nuclear and cell membranes, so to synthesize an image displaying an endosomal protein’s distribution pattern, one might generate a nuclear shape, then, given that the cell membrane is always outside the nuclear membrane and their orientations are correlated [11
], generate a cell shape whose probability distribution depends on the selected nuclear shape, and finally generate endosome-shaped objects so that they lie between the two shapes. Combined together, these pieces produce a generated image which is analogous to a real multichannel microscope image. Open source software components that can learn such conditional models and synthesize instances as images are available at http://CellOrganizer.org
One important issue in modeling is the balance between accuracy and precision and how it is affected by model complexity, i.e. the number of parameters. We want to choose a level of complexity that will allow the model to generate images resembling real ones while maintaining computational feasibility. For example, a nuclear shape approximated by an ellipse could not incorporate bends or blebs, but a model using polygons with thousands of vertices might take hundreds of thousands of images to have its parameters properly estimated.