Given methods for learning the fundamental patterns and estimating how much of a given protein is associated with each of them, we now turn to the question of how to communicate the nature of each of those patterns. At a conceptual level, the most complete model of subcellular organization is probably the GO Cellular Component ontology. We can imagine easily associating GO terms with most (but perhaps not all) fundamental patterns by checking which organelle markers are assigned to each. However, such a conceptual model does not provide a spatially accurate representation of each fundamental pattern (including how that pattern varies from cell to cell).

To be useful for spatially realistic modeling, ontology terms must be associated with a representation of things like the number of objects in a component and their structure and distribution within cells. Currently, such representations are abstract and implicit rather than concrete and they often leave unspecified how the organelle would look in different cell types. For example, the abstract concept of a mitochondrion is well-understood by biologists but most would be hard pressed to describe how mitochondria vary in number, size, shape and distribution from cell type to cell type or organism to organism.

Of course, one approach is simply to represent each pattern with an example image containing it. This can be extended by representing a pattern by all (or a subsampling) of its images. This leaves open the question of how to integrate the information into other systems, especially when it is desirable to know how large numbers of patterns would look in the same cell. We have proposed that the learning of generative models of each pattern is a solution to this problem (

9) (see , lower path). In this context, we define a model as generative if it can produce synthetic images that are by some specified criteria statistically indistinguishable from real images used to train it.

A key issue in building generative models of cells is that they need to contain pieces that depend on each other. For example, synthesizing an image showing the distribution of lysosomes is dependent on having a cell boundary within which to place them, and the position of the cell boundary and the nuclear boundary must be dependent on each other so that the nucleus is inside the cell. We have chosen to address the latter issue by starting with a model of nuclear shape and making the cell shape model dependent (conditional) on it, but the opposite approach is also feasible.

Nuclear models

Another important issue in building models is choosing an appropriate level of complexity with which to represent instances (examples) of the model. For example, we can consider everything from modeling all nuclei as ellipsoids (

10,

11) to making a detailed tracing or mesh representation of the surface of each nucleus. For building models of 2D subcellular patterns, we considered a compromise in which eleven parameters were used to describe a medial axis representation of each nucleus. This process is illustrated in . The advantage is that the model is compact but still captures much of the variation in length, width, and curvature. The disadvantage is that it cannot represent forked shapes, which were not observed in the images of unperturbed HeLa cells used in our initial work, but can be observed under other conditions.

To address this, we have developed an alternative, diffeomorphic approach to describing and modeling nuclear shape (

12–

14). A related approach was first described by Yang et al (

15) for the purpose of registering nuclear images. The principle is that the variation in shape among a population of nuclei can be represented by a measure of the pairwise differences in their shapes (

12). This measure is found by determining how much work must be done to morph one of them into the other. The result is a square, symmetric matrix of dimensions corresponding to the number of nuclei. Using multidimensional scaling, this matrix can be converted such that each nucleus is represented by a vector in some Euclidean space. The higher the dimension of that space, the closer the approach comes to perfectly capturing the original distance matrix (it is perfect to within numerical accuracy when the dimension equals the number of nuclei). shows reducing the dimension to just two. Remarkably, variation along the first dimension corresponds to nuclear elongation, and variation along the second dimension corresponds to bending. This gives a very compact representation of the shape variation in the nuclear population (

12,

13).

However, this approach does not directly give a means of generating new shapes. This limitation was overcome by recursively interpolating shapes at points in the shape space chosen according to a probability density function estimated from the original nuclei (

14). This permits the diffeomorphic approach to be used in a generative model, but requires that the original nuclear shapes be saved along with the reduced shape space coordinates of each and the probability density function. The amount of storage required can be reduced by saving a smaller number of examples (e.g., just examples at peaks in the probability density function).

We can now consider a single generative framework for storing models of nuclei and other cell components in which the first “slot” of the framework specifies which type of nuclear model to use as well as the parameters for that model. In the case of the diffeomorphic model, the parameters are very extensive. Other types of generative models of nuclear shape can be used (

16,

17), although our overall philosophy is to prefer models whose parameters are automatically learned from images.

Cell shape models

The next “slot” of the generative framework is filled by a cell shape model. While approaches that model cell shape alone have been described (

18–

20), we focus on building a cell shape model that is learned directly from images and conditional on the nuclear shape. This is in order to ensure that the two shapes are compatible with each other (e.g., that the nucleus is inside the cell!) and that any relative orientation of the two is captured. For this we use a simple approach in which a cell to be modeled is first rotated so that its major axis is pointing in a defined direction and flipped (if necessary) so that the side (relative to the major axis) with the larger area is also matched. The ratio between the distance from the center of the nucleus to the nearest point on the plasma membrane and the distance from the center to the nearest point on the nuclear membrane is then measured at angles from 0° to 360° relative to the major axis. This set of relative coordinates is reduced to a small number (

10) of principal components. New cell shapes can then be synthesized (after synthesizing a nuclear shape) by randomly choosing values for the principal components and using the synthesized ratios to mark out the cell boundary. Conditional, diffeomorphic models of cell shape can also be made.

Models of subcellular components

We now turn to the most difficult part of building cell models, representing subcellular components. Much work remains to be done in this area. Two distinct but preliminary approaches for representing a subset of protein patterns are described here.

Object-based models: Direct learning This first is building object-based models (

6). This approach is most suited to organelles such as endosomes, lysosomes, and peroxisomes that largely exist as discrete vesicles. As a first approximation, these can be modeled as Gaussian objects, that is, as circles (or spheres in 3D) whose intensity decreases with distance from its center (as expected if its intensity in a given pixel was proportional to the volume that underlies that pixel). Since cell images often have two or more vesicles touching or overlapping, we estimate the number and sizes of the vesicles that are most likely to have given rise to a particular image using non-linear fitting. Doing this for many cells allows distributions to be learned for the number of objects per cell and their size variation in each cell. The position of each object relative to the nearest point on the nucleus and the nearest point on the cell membrane is then calculated and used to create a 2D (or 3D) position probability density function. The synthesis of new patterns is then quite simple. For each cell, a number of objects is drawn from the number distribution, and a size and position are sampled from the size distribution and the position probability density function, respectively. These are used to place the objects into the nuclear and cell shape model described above. An example of a synthesized image showing a lysosomal pattern is shown in .

Network models: Inverse Modeling The second approach is designed for network distributions, such as the tubulin cytoskeleton, that are not appropriately modeled as objects. Since elements of such networks frequently cross and pile up near the center of the cell, it is difficult to estimate parameters of a model from conventional microscope images. One solution is to use specialized microscopic methods, such as speckle microscopy, that image only a portion of the network at a time (

21). Excellent models of actin polymerization in the leading edge of a crawling cell have been obtained by this approach (

22). Speckle microscopy requires suitable polymerization and depolymerization rates and may not be appropriate for all network proteins. An alternative for extracting model parameters from wide-field microscope images is to use inverse modeling. The principle is that the parameters that describe the state of a network in a real image can be estimated using a model that can synthesize images for many parameters values and a comparator that finds the synthetic image whose appearance is closest to the real one. One of the earliest uses of this approach was to estimate spindle dynamics (

23). We have recently described a simple but justifiable model of microtubule polymerization in interphase cells and demonstrated that it can be used to make reasonable estimates of the number, length distribution and degree of growth direction persistence of HeLa cells (

24).

Combining component models: independent or conditional A major goal of the model building described here is to be able to create cell models containing spatially realistic distributions for many different proteins. Since the number of different proteins that can be measured in the same living cell is currently less than ten (although the number in fixed cells is at least one hundred (

25)), it is difficult to imagine using multi-color images directly for this purpose. An alternative is to combine subcellular models learned from separate sets of images. This can be done by constructing a single nuclear and cell shape and then adding objects or networks in turn for each additional component. This assumes that these distributions are independent of each other. If this is not the case, the placement of one component can be made conditional on that of another. For example, endosomal positions can be preferentially placed along microtubules.