|Home | About | Journals | Submit | Contact Us | Français|
Given the importance of subcellular location to protein function, computational simulations of cell behaviors will ultimately require the ability to model the distributions of proteins within organelles and other structures. Towards this end, statistical learning methods have previously been used to build models of sets of two-dimensional microscope images, where each set contains multiple images for a single subcellular location pattern. The model learned from each set of images not only represents the pattern but also captures the variation in that pattern from cell to cell. The models consist of sub-models for nuclear shape, cell shape, organelle size and shape and organelle distribution relative to nuclear and cell boundaries, and allow synthesis of images with the expectation that they are drawn from the same underlying statistical distribution as the images used to train them. Here we extend this generative models approach to three dimensions using a similar framework, permitting protein subcellular locations to be described more accurately. Models of different patterns can be combined to yield synthetic multi-channel image containing as many proteins as desired, something that is difficult to obtain by direct microscope imaging for more than a few proteins. In addition, the model parameters represent a more compact and interpretable way of communicating subcellular patterns than descriptive image features, and may be particularly effective for automated identification of changes in subcellular organization caused by perturbagens.
Proteins function in different cellular or subcellular compartments as part of complex systems. In systems biology, investigating and modeling these complex systems from different aspects and at various levels is hoped to lead to a mechanistic understanding of cell behavior [1, 2]. Extensive efforts have been made towards proteome-scale determination of protein sequence, structure, abundance and interactions and tremendous progress has been achieved. Much less information is available about protein location within cells, with descriptions using words (such as GO terms) being the main approach used to represent this important concept. More detailed and comprehensive approaches to learning and describing the spatial distributions of proteins at different levels of accuracy will be critical for systems models.
Development of modern microscopy technology makes observation of protein localization possible both in vitro and in vivo with high throughput. However, traditional visual analysis to recognize protein localizations can be a key barrier for converting large sets of images to useful descriptions of protein locations. To overcome this difficulty, machine learning methods and digital image processing tools have been combined to develop systems that automatically recognize protein subcellular location patterns . Here “pattern” designates the subcellular distribution of a protein, or of a set of proteins whose distributions are statistically indistinguishable. The most critical component of these systems is sets of numerical features to describe protein subcellular location patterns in 2D or 3D images. With these features, the feasibility is capable of classifying major protein subcellular location patterns with high accuracy and efficiency compared to visual analysis has been demonstrated [4, 5].
However, recognition of location patterns provides only limited information. For example, describing a protein location as “nucleus” in a given cell type under a given condition provides no detail on how it is distributed within the nucleus (and of course no information on the size or shape of nuclei in that cell type). Similarly, recognition based approaches can describe a protein’s “relocation from organelle A to B“ but communicates no information about how this process happens spatially and geometrically. Thus, beyond simply recognizing subcellular location patterns, an important goal is to be able to build models to capture the essence and variation of a specific pattern.
Zhao and Murphy  describe the first system for constructing generative models of subcellular patterns in 2D images, providing a framework in which cell structure and subcellular location patterns can be represented and communicated. In this work, images are viewed as the manifestation of a set of random variables and image synthesis or generation is viewed as a stepwise random process. A statistical generative model is the combination of all distributions of these random variables. Building a generative model for images in the form of a joint distribution of all pixels in an image is too computationally expensive (and potentially underdetermined) to be practical. Therefore, methods of computational geometry and data analysis were explored in order to compromise between complexity and accuracy of the model. 2D fluorescence images of cells were represented by three major components: nucleus, cell membrane, and protein objects distributed inside these compartments. All three components were represented by small sets of parameters (much fewer than the number of image pixels) from which the key features related to protein locations in the original image can be reconstructed with reasonable accuracy. The three components were modeled conditionally on each other, e.g., the output of the model of organelle position takes as inputs the instances drawn from models of cell and nuclear shape.
While this initial approach is useful for the vast majority of fluorescence microscope images that are acquired in only 2D, they represent a significant simplification of actual cell organization in 3D. Therefore, in this paper we describe extending the generative modeling and simulation framework to 3D.
For the studies describe here, we used the same 3D HeLa dataset  used previously for testing 2D models (the previous studies used only the central slide of each 3D stack). The dataset is available at http://murphylab.web.cmu.edu/data and contains three fluorescent channels for each field: a DNA channel that reflects nuclear shape and chromosome texture; a total protein channel that reflects cell shape; and an antibody channel that reflects the distribution of a particular protein. Each field has been previously segmented into single cell regions using a seeded watershed approach. The entire 3D stacks of 447 images were included to build the 3D nuclear and cell shape models. Protein location models were created for four “object-like” protein location patterns— lysosomes, mitochondria, endosomes and nucleoli, each with around 50 training 3D image stacks. Each stack contains 14 to 29 slices containing 1024 × 1024 voxels. The voxel spacing is 0.049µm × 0.049µm × 0.203µm.
The 3D generative model algorithm used in this work was built upon the toolbox for 2D generative models (available from http://murphylab.web.cmu.edu/software) and implemented using MATLAB (version 7.8.0). Code and trained models will be available upon publication from the same site. Methods and algorithms to create each component of the model are described in Results. Before the modeling step, each image slice was thresholded using the Ridler-Calvard method . This was done rather than using a global threshold because lower slices in the 3D stacks are subject to photobleaching effect. Nuclear and cell shapes were rotated to have the major axis aligned using a principal axis alignment method described in . The Matlab spline toolbox (version 3.3.6) was used for nuclear shape modeling. The EM code used in protein object modeling from the NETLAB library (http://www.ncrg.aston.ac.uk/netlab) was modified to estimate weighted Gaussian mixtures with interval inputs (instead of points).
We begin by building a 3D nuclear shape model. In the previous 2D model, a nuclear shape was represented by curves describing the medial axis and the width along it. These two curves were each parametrically approximated by fourth order B-spline curves. The medial axis representation has also been applied in modeling 3D shapes, such as pancreas . However, it is not easy to find a concise and proper representation for 3D nuclear shape with this method, especially for nuclei of cultured cells, which are usually at (Figure 1a).
We therefore consider the 3D shape of a nucleus to be described by a parametric surface [x(,z), y(, z), z(,z)]. The definition becomes much clearer in a cylindrical coordinate system
The cylindrical representation is an “unwrapping” of the side surface of a nuclear shape to a surface function r(, z) with rectangular support (Figure 1b). The conversion from Cartesian to cylindrical coordinate systems included resampling of the digitized image.
As previously done for 2D models, we used splines to parameterize the 3D nuclear shape as they were observed to give very good fits. A tensor product spline surface was used to fit the surface function r(, z).
where m and n are the number of control points for and z, respectively.
In the tensor form, Ni,p() and Nj,q(z) are B-spline basis functions of degree p and q, respectively; τi,j are the control points or coefficients of the basis functions. Parameterization of the surface function r(, z) is achieved by least-square approximation.
where N1 and N2 are the number of angles and slices in the digitized image, respectively. rB(i, zj) is the radius calculated from the fitted spline surface function at (i, zj), where real value r(i, zj) is observed. To achieve good fit with reasonable complexity, and to eliminate the effect of high frequency digitization artifacts, we pick the orders of the B-spline basis functions as p = 4 and q = 3 for the HeLa nuclei (Figure 1c), with m = 8 and n = 3. We observed the fitted values of the internal knot points to be around (1/4, 3/8, 1/2, 5/8, 3/4) along the azimuth coordinate and 1/2 along the height coordinate and to have little contribution to the variation of the spline surface. The positions were therefore set to be constants.
Fitting the “unwrapped” surface ignores the continuous condition at 0 = 0 and N1 = 2π.
Fujioka and Kano  have proved that continuity is maintained if and only if the coefficients τ satisfies
We append these constraints to the least-square approximation problem in equation 3 to guarantee periodicity. This reduces the number of free parameters from (m + 1) * (n + 1) = 36 to 32. Adding a parameter for the height, each nuclear shape is therefore represented by a total of 33 free parameters. All nuclei are rotated and mirrored (if necessary) according to the central slice so that they all have same “elongating” and “bending” direction. Statistical learning is then performed on the shape parameters extracted from these nuclei images to describe the variation of the shapes from nucleus to nucleus. We chose a multivariate normal distribution to model the variation of the parameters. As shown in Figure 2, comparison of the data empirical distribution with the fitted normal distribution validates this choice. The nearness of each plot to the diagonal line indicates close agreement to theoretical normal distributions. The mean and median of p-values under Kolmogorov-Smirnov normality tests for all parameters are 0.28 and 0.19, respectively, also confirming the choice. Thus we conclude that the nuclear shape of HeLa cells can be captured by a statistical model with 562 values: a length 32 vector τ for the means of the spline coefficients and the unique elements of the symmetric 32 × 32 covariance matrix, plus the mean and standard deviation of the height.
To synthesize a nuclear shape, we draw the parameter values from the trained distributions and then construct a new shape from the parameters.
The next step is to construct a model of the cell shape (or plasma membrane). Compared with the nuclear boundary, the cell boundary has more local morphological flexibility and much larger variation from cell to cell, making it hard to be parameterized to a statistically simple representation.
However, it is obvious that the cell shape is conditioned on the nuclear shape (at least the nucleus is inside the cell). Instead of building a parametric cell shape model and a correlation model separately, we adopt the approach previously used in the 2D model: learning the ratio of the radius of the cell to that of the nucleus. As for the nucleus, we first define a polygonal representation of the cell shape in a cylindrical coordinate system, denoted as rc(, z). The radius ratios are then expressed as a function of and z: R(, z) = r(θ, z)/rc(θ, z). We sampled over 360° in 1 degree increments, and sampled z over 18 slices. This was the average number of slices per cell; all cells were resampled to this value along the z dimension. The ratio representation contains 360 × 18 = 6120 values.
Direct statistical estimation is impossible for the 6120-dimensional vector with only hundreds of samples, and is not necessary to guarantee accuracy. Instead, we applied principal component analysis (PCA) to the ratio vectors after centering them by subtraction of the mean vector. Instead of performing PCA directly on the original covariance matrix, we use a modified PCA method to find significant modes and coefficients quickly by applying eigen analysis on , where R is an N-column matrix with each column a centered ratio vector. This modified PCA method is a more efficient approach for eigen decomposition with far more original data dimension than number of samples, based on discussions in both [11, 12].
With 20 most significant principal modes the shape ratio can be reconstructed substantially (Figure 3). Moreover, the coefficients λij of each mode appear to be normally distributed and independent from each other (Figure 4). The mean and median of p-values under Kolmogorov-Smirnov normality tests for all parameters were 0.29 and 0.22, respectively. The cell shape model then contains 21 6120-dimensional constant vectors for the mean and principal modes, and 20 variances of the coefficients (the coefficient means are zero).
To generate an instance of cell shape, we first synthesize an instance of nuclear shape as described in the previous section. We then sample values for each principal component from its normal distribution, generate centered ratio vectors by multiplying by the saved principal modes, and add back the mean ratio vector. The result has a fixed height of 18 slices. We next sample a cell height from its normal distribution, stretch the synthesized nuclear shape to correspond to this height, apply the ratios at every (θ, z) and form a cell shape by connecting line segments between neighboring surface points. The cell shape is stretched back so that the nuclear height corresponds to its original height. Note that this assumes that the nucleus and cell cover the same number of slices; this was observed to be the case for HeLa cells which have little cytoplasmic space above or below the nucleus (relative to the slice thickness).
Protein subcellular location patterns have been successfully represented using “objects” in a number of previous studies, such as location pattern recognition and complex pattern unmixing [4, 13, 14, 15]. “Objects” are defined as contiguous regions of non-zero pixels. Many cell organelles have roughly ellipsoidal shapes and appear to be ellipsoids with intensity clustered in the center in the images. Zhao and Murphy  therefore used 2D Gaussian distribution functions to fit objects in 2D images and modeled each ellipsoidal family as the distribution of the Gaussian distribution parameters. In this paper, we also focus on modeling protein location patterns consisting of mostly ellipsoidal objects in 3D images by extending the methods of Gaussian objects learning into 3D.
As in 2D images, ellipsoidal objects in 3D images may aggregate to form larger objects to which single Gaussian gives a poor fit. Using the expectation-maximization (EM) algorithm, we can estimate parameters of each Gaussian component in an aggregated object as a Gaussian mixture. The probability density function (PDF) of a Gaussian mixture distribution can be written as
g(x|μk, Σk) is Gaussian PDF over three variables x = (x, y, z) with mean μk and covariance matrix Σk. pk is probabilistic weight of the kth Gaussian component in the mixture ().
As in the 2D case, we weight each data point with the intensity value of the voxel (wi = I(xi, yi, zi)). The E-step of the algorithm to fit the Gaussian mixture stays the same for 3D objects. αik is the expected probability that point xi is associated with component k. The superior (t) added to the mixture model PDF parameters means the values at step t of the iterative EM process. The Gaussian parameter for each object is fitted at convergence.
In the previous 2D model, pixels were treated as points. However, taking account of the digitization effect, the pixels and voxels are better treated as intervals. Therefore, the Gaussian parameters estimation in the M-step should be slightly changed. The variances should increase by 1/12 to avoid zero variance at certain dimensions (1/12 is the variance of a standard continuous uniform distribution).
We then perform statistical learning, mostly distribution fitting with maximum likelihood estimation, on the decomposed Gaussian function parameters. The centers (mean) of the Gaussian objects are used to learn the object position model, which is described in the next section. The covariance matrix Σ is directly related to the size of the object and we use it to train an object size model. The ellipsoid Gaussian objects are rotated to have their major and minor axis aligned to the Cartesian coordinates (the rotation transformation can be achieved by eigen decomposition of Σ). Each object is then represented in size by three standard deviations σX, σY and σZ in descending order. As the variances along different coordinates are highly dependent, we use a simple Bayesian structure to fit their distribution: σY ← σX → σZ. Figure 5a shows standard deviation on the major axis is well fit by exponential distribution. Conditional distribution of σY or σZ over σX are also fit by normal distributions (normality of the conditional distributions of P(σY |σX) or P(σZ|σX) is shown in Figure 6). The mean and median of p-values under Kolmogorov-Smirnov normality tests for all parameters are both 0.37. Figure 5b,c,d,e show the dependency of the normal parameters on σX, along with their parametric fitting by
The other important parameter to describe a single Gaussian object is its intensity. It is defined to be the coefficient between fitted object and the Gaussian function, which is normalized. γi for an object is estimated from the product of total intensity of the aggregated object it resides in and the fraction this object in the whole mixture Ii(x) = γi · g(xi). Again, we choose exponential distribution to fit the intensity coefficient γi (Figure 7).
To synthesize an individual Gaussian object, we simply generate the σ, γ values and use them to synthesize a 3D Gaussian function γ · g(x|0, Σ). The Gaussian function is digitized and added to the 3D image at a specific position as determined below.
We modeled the positions of protein objects relative to the nuclear and plasma membranes. The model we used here is a direct extension of the 2D protein object position model used previously.
We parameterize the position of each object by three variables: s, ratio of the distance of a given object’s center to the nuclear surface over the sum of that distance and the distance to the cell surface; and θ and , the inclination angle and azimuth angle of the vector from the nuclear center to the object center. Using this parameterization, a distribution was formed in which all points where an object occurs were set to 1 and all other positions set to zero. After normalizing for the total number of objects, the potential (the probability that a given position is the center of an object) can be fitted by a logistic regression function
Here the parameters of β have different interpretations for object location “preference”. For example, β1 shows whether the protein is more likely to distribute near the nucleus or near the cell membrane. β3, β4 and β5 determines the protein angular preference. Table 1 shows learned β for different location patterns. As expected, nucleoli show a preference to be inside the nuclear membrane (negative β1).
At this point, the only aspects of the model that are not statistically modeled are the number of Gaussian objects and the direction of alignment of each of them to the central axis. These parameters did not show an easily fitted distribution (data not shown) and therefore for image generation values for them are randomly sampled from the empirical distributions. Given synthesized nuclear and cell shapes, Gaussian objects and positions, we can synthesize full images depicting protein subcellular location patterns. Figure 8 shows synthesized location images with three channels of the four patterns. Additional images can be generated using the tools and models available at http://murphylab.web.cmu.edu/software.
The work described here enables significantly more accurate and realistic descriptions and simulations than prior work. Nonetheless, much further work is required to further refine and extend these models. In particular, methods are needed to capture patterns not well represented by discrete objects. Work on learning generative models for microtubule distributions represents an initial step in this direction .
As in any parametric model, our approach includes a number of choices for what parameters to use and how to capture their distributions (mostly using normal and exponential distributions). Their generalizability to new cell types is therefore unknown. For such applications in the future, it will therefore be important to revisit these choices, and perhaps ultimately to make them using large collections of images for many cell types. In cases were parametric models do not perform well, alternatives such as the instance-based diffeomorphic shape generation models described previously  may be considered.
With our approach, images can be generated that appear to the eye to capture the essential features of protein location patterns. These synthesized images are suitable for simulations (using systems like Virtual Cell  and MCell ), and their utility can ultimately be evaluated (and hopefully increased) through simulations using them to predict cell behaviors. It is important to note that in line with their anticipated use in simulations, our models are abstractions not intended to generate images that appear similar to actual microscope images. However, it is a straightforward matter to generate such images by convolution of the idealized image with the point-spread function for a particular microscope .
While our 2D and 3D systems are the only ones described to date that construct generative models of organelles within the context of cell and nuclear organization, other investigators have described approaches for modeling or simulating nuclear or organelle size and shape [20, 21]. Models resulting from such work can readily be incorporated into the conditional model framework we have described.
Grant sponsor: National Institutes of Health; Grant number: R01 GM075205