|Home | About | Journals | Submit | Contact Us | Français|
Commonly employed data models for maximum likelihood refinement of electron microscopy images behave poorly in the presence of normalization errors. Small variations in background mean or signal brightness are relatively common in cryo-electron microscopy data, and varying signal-to-noise ratios or artifacts in the images interfere with standard normalization procedures. In this paper, a statistical data model that accounts for normalization errors is presented, and a corresponding algorithm for maximum likelihood classification of structurally heterogeneous projection data is derived. The extended data model has general relevance, since similar algorithms may be derived for other maximum likelihood approaches in the field. The potentials of this approach are illustrated for two structurally heterogeneous data sets: 70S E.coli ribosomes and human RNA polymerase II complexes. In both cases, maximum likelihood classification based on the conventional data model failed, whereas the new approach was capable of revealing previously unobserved conformations.
Over the last decades, three-dimensional electron microscopy (3D-EM) has developed into a widely applicable technique for the structural characterization of biological complexes. On one hand, ever increasing resolutions are obtained for well-behaved (conformationally stable) macromolecular complexes, currently reaching up to 3.8 Å for icosahedral virus reconstructions (Zhang et al., 2008; Yu et al., 2008) and up to 5.4 Å for particles with low or no symmetry (e.g. see Stagg et al., 2008). On the other hand, 3D-EM techniques are being applied to ever more complicated samples. The structural characterization of highly flexible cellular machines is nowadays feasible through the single particle reconstruction approach of purified samples (Stark and Lührmann, 2006; Grob et al., 2006; Nickell et al., 2007), while the characterization of the molecular atlas of whole cells is within reach of modern cryo-electron tomography (Nickell et al., 2006; Robinson et al., 2007). Together with numerous instrumental improvements, these advances have gone hand-in-hand with important developments in image processing techniques in the field.
With the image processing tasks becoming ever more complicated, there is a growing interest in the use of statistical methods in 3D-EM and in particular in maximum likelihood approaches. Perhaps the most important characteristic of the maximum likelihood approach is the natural way in which the noisy character of the experimental data may be modelled. This is especially relevant in the case of cryo-EM data, where a limited electron dose to prevent radiation damage results in extremely low signal to noise ratios. The maximum likelihood approach has now been applied to a range of different image processing tasks, such as 2D alignment (Sigworth, 1998), 2D classification (Pascual-Montano et al., 2001), 3D reconstruction of icosahedral viruses (Vogel and Provencher, 1988; Doerschuk and Johnson, 2000; Yin et al., 2003; Lee et al., 2007), alignment of 2D crystal images (Zeng et al., 2007), or 3D classification of heterogeneous projection data (Scheres et al., 2007a).
All these approaches employ the same statistical data model that assumes white gaussian noise in the data. Recently, we also introduced an alternative model for coloured noise (Scheres et al., 2007b), but the assumption of gaussian noise remains a common factor for all maximum likelihood approaches in the field. Although the explicit description of the experimental noise in the maximum likelihood approach offers general advantages over conventional approaches, it may also present important limitations in specific cases. The distance metric that underlies the gaussian model is based on the squared Euclidian distance between an experimental image and its template. In contrast to the conventional (normalized) cross-correlation coefficient, the Euclidian distance metric is highly sensitive to differences in image background and signal brightness. This means that any maximum likelihood approach based on this metric may suffer from variations in background mean or signal brightness among the data. In the case of image classification for example, the data may be separated in subsets with similar image backgrounds or signal brightness rather than in structurally homogeneous subsets.
In practice, one aims to minimize the variations in background mean and signal brightness by normalizing the data. Because the abundant noise in 3D-EM data makes it difficult to normalize the signal itself, it is common practice to normalize the noise instead. Typically, one subtracts a least-squares plane to obtain zero-mean backgrounds and subsequently divides by the standard deviation to obtain similar noise intensities among all images. To account for the fact that different orientations of an asymmetrical particle may yield projections with different signal powers, one often calculates this plane and standard deviation over an area of the image that presumedly contains only noise (Sorzano et al., 2004a). However, the presence of neighbouring particles in this so-called background area, or variations in the signal-to-noise ratios (e.g. due to different ice thickness or defocus values) may lead to remaining variations in the signal intensity among the normalized images. Consequently, to some extent 3D-EM data sets always display non-zero background means and variations in signal brightness. Also the relatively common practice to high-pass filter the particles is not a remedy for this problem, as the additive and multiplicative variations in their underlying signal do not necessarily relate to the power of the entire image.
Therefore, we propose an extension of the commonly used data model of white gaussian noise that allows describing variations in background mean and signal brightness among the images. This model is generally applicable to any of the existing maximum likelihood approaches in the field, although here we will focus on the problem of 3D classification. This problem plays a crucial role in the single particle analysis of flexible macromolecular complexes that adopt multiple conformations and may vary in subunit composition or ligand occupancy. The way these complexes work may often be inferred from their distinct structural states, but the difficulties in biochemically purifying these states make it cumbersome to study them. Cryo-EM allows recording projections of individual particles that are free to adopt any of their functional states. Thereby, one may obtain structural information about a whole range of conformations from a single cryo-EM experiment, provided that one can sort the data into subsets of projections from particles with identical 3D structures. However, this process of in silico purification currently still represents one of the major challenges in 3D-EM single-particle analysis (Leschziner and Nogales, 2007).
Based on the proposed statistical model for data with normalization errors, we have derived a maximum-likelihood algorithm for the 3D classification of structurally heterogeneous projection data. We call this algorithm MLn3D classification, to distinguish it from the previously introduced ML3D algorithm that is based on the commonly employed data model without normalization errors (Scheres et al., 2007a). Here we illustrate the usefulness of the new algorithm for two highly challenging cryo-electron microscopy data sets: a 70S E.coli ribosome data set and a data set on human RNA polymerase II in complex with human Alu RNA (Mariner et al., 2008). For both data sets, we show how the conventional maximum likelihood approach failed due to normalization errors in the data, whereas the MLn3D algorithm was capable of separating distinct, previously unobserved structural states.
We model 2D images X1, X2, …, XN as follows:
To make effective use of the data model in (1), we estimate by way of maximum likelihood estimation. We view the estimation problem as a missing data problem, where the missing data associated with Xi are the position Φi and the random index κi. The associated is viewed as an unknown parameter, not as missing data. So, the complete data set is
a random sample of (X, Φ, κ).
Note that this random variable has one discrete component, to wit k, and two continuous components. The joint distribution may then be written as
thus defining the probability vector πo, which represents the unknown distribution of the data among the different classes. The distribution of the orientations and in-plane positions of the images is modelled by f(|k). This distribution involves the assumption that particle picking has yielded roughly centred particles with residual offsets according to a two-dimensional Gaussian, centred at the origin. The corresponding formulae have been described in detail previously (Scheres et al., 2007a) and will not be repeated here. According to the noise model in (1), we calculate f(Xi|, k) as follows:
The marginal pdf of Xi is then a mixture,
and the maximum likelihood estimation problem is to find those parameters Θ* that maximize the logarithm of the joint probability of observing the entire set of images X1, X2, …, XN:
Note that, apart from the parameters describing f(|k), the unknown parameter set Θ contains σo, πo, so, co and Vo, with
The log-likelihood target function may be optimized using expectation maximization (Dempster et al., 1977). In the E-step of this iterative algorithm, a lower bound Q(Θ; Θold) to the log-likelihood is built based on the current model parameter set Θold:
where, is the probability distribution of the hidden variables conditioned on the observed measurements. This distribution may be calculated as:
In the subsequent M-step of the algorithm, we optimize the lower bound with respect to all model parameters.
The updates of the mixing proportions πnew may be calculated independently from the updates of the other model parameters:
The update of the other model parameters is more complicated. Because the log f(Xi|, k) term in (8) depends on σo, so, co and Vo, a strict expectation-maximization algorithm would have to optimize the lower bound for all these parameters simultaneously. Instead of trying to solve the corresponding non-linear system of equations, we implemented the alternative that is outlined below. We note that a similar approach has previously been taken in other maximum likelihood approaches in the field, where the lower bound depends simultaneously on the model parameters for the signal and the standard deviation of the noise (e.g. see Sigworth, 1998; Zeng et al., 2007). For reasons of computational speed, one typically updates these two parameters separately, but in a strict sense this is not guaranteed to optimize the log-likelihood function. In practice however, these subtle differences do not seem to interfere significantly with the convergence behaviour of existing algorithms, and in our experience thus far the MLn3D algorithm is no exception.
First, we set the partial derivatives of (8) with respect to si and ci to zero and solve for these variables respectively, yielding:
where (a) · (b) denotes the dot product between a and b, and (a)j denotes the jth pixel of a.
Secondly, we use the updated and in the updates of σ and V. Again setting partial derivatives to zero and solving for σ yields:
and the updated V may be obtained by solving the following K least-squares problems separately:
to which purpose we use a modified ART algorithm as presented previously (Scheres et al., 2007a).
Finally since the overall brightness of V is directly correlated to the values of , for each reference k we constrain the average image brightness to one (i.e. ).
The above exposed algorithm was implemented in the open-source package XMIPP (Sorzano et al., 2004b), and may be accessed conveniently as an expert option in the recently implemented standardized protocols (Scheres et al., 2008). Because the integration over T (which in practice is replaced by a Riemann sum over a discrete grid) is extremely computation-intensive, we also implemented a reduced-space approach as presented previously (Scheres et al., 2005). Furthermore, besides the proposed algorithm for 3D classification, we implemented a related 2D classification algorithm. In that case, instead of optimizing (8) with respect to 3D-structures V1, …, VK, one optimizes this function with respect to 2D images A1, …, AK. The algorithm remains basically the same, except for the fact that in this case R represents an in-plane transformation (parametrized by a single rotation angle and two in-plane coordinates), and the least-squares problems in (14) are replaced by the following update formula:
To illustrate the usefulness of the MLn3D algorithm, we first show the results obtained with conventional ML3D classification (Scheres et al., 2007a) on a 70S ribosome complex from E. coli programmed with mRNA and containing deacylated tRNAfMet in the P site and fMetLeu-tRNALeu in the A site. An initial data set of 69,262 individual particles was used to calculate a cryo-EM density map to 13 Å resolution. The overall configuration revealed an unratcheted ribosome with strong tRNA density in the P site, but scattered density in the A and E sites, not accounting for full tRNAs (not shown). The poor representation of the A and E-site tRNAs in the map suggested a low occupancy of these sites and/or the presence of a mixture of different positions. The latter possibility prompted us to perform an unsupervised ML3D classification of these data. However, no apparent conformational differences could be observed among the resulting maps when using four classes (Figure 1a). Instead, pairwise difference maps consisted of positive or negative density throughout the ribosome particle.
Starting from the same four seeds, the MLn3D algorithm yielded maps representing ribosomes in distinct structural states (Figure 1b). Three of the classes (together accounting for approximately 80% of the particles) showed the ribosome in an unratcheted state, while the fourth class (the remaining 20% of the particles) revealed a ratcheted ribosome. Separate refinements of the two classes to higher resolution revealed that the tRNAs in the unratcheted ribosome are positioned at the classical A and P sites, while the ratcheted ribosome showed a previously unobserved conformation with tRNAs in the hybrid A/P and P/E sites (see (Julián et al., 2008) for details).
The MLn3D-refined values for the background mean and signal brightness of every experimental particle were then used to analyse a posteriori why the ML3D run had failed. Histograms of these values for all images assigned to each of the four classes indicated that the conventional algorithm had indeed separated the images based on background mean as well as on image brightness, while, as expected, no such separation could be detected for the MLn3D algorithm (Figure 2a-d). A similar observation could also be made without the MLn3D-refined values of the normalization parameters. Radial average density profiles of all unaligned experimental images assigned to each of the four ML3D classes already hinted at a separation based on differences in signal brightness and/or background mean (Figure 2e-f). Finally, a visual inspection of images with relatively high or low refined values for the signal brightness or background mean suggested that neighbouring particles may be related to variations in background mean as well as image brightness, while differences in ice thickness or defocus values mainly affect image brightness (results not shown).
The second test case concerns human RNA polymerase II in complex with the inhibitory human Alu RNA (Mariner et al., 2008). Application of the conventional ML3D algorithm with two classes yielded the maps that are depicted in Figure 3a. In this case, some putative conformational variability could be discerned between the resulting maps, but the absolute differences were relatively small. Much larger differences were obtained with the MLn3D algorithm (Figure 3b), which was again started from the same seeds as used for the conventional ML3D classification. In this case, the difference map showed specific regions of strong positive and negative density, which are indicative of a separation of the data according to conformational variability. The largest differences are located at the clamp of RNA polymerase II and around its DNA/RNA hybrid binding site. Smaller differences can be seen in the stalk domain and between the clamp and the stalk. These differences in conformation could be relevant to the different binding and inhibiting properties of the Alu RNA. Further interpretation of the functional significance of the different hRNAPII/Alu RNA conformers will be presented elsewhere.
In this case, the posterior analysis of the refined normalization parameters showed that the conventional ML3D algorithm had, at least partially, separated the data based on differences in background mean alone rather than also on signal brightness (Figure 4). Again, no signs of separation based on normalization errors could be detected for the MLn3D algorithm. Furthermore, radial average density profiles of the two ML3D-classes showed a marked discontinuity at the radius used for the background circle in the normalization protocol (see Methods), directly linking the ML3D classification results with the normalization of the individual images. Also in this case, a visual inspection of the images with relatively high or low refined values for the background mean indicated that this variation may be related to the presence of neighbouring particles (not shown).
The key to the advantage of maximum likelihood approaches over conventional refinement techniques lies in a more adequate statistical data model for 3D-EM images. In an intuitive manner, the explicit description of the abundant experimental noise allows to discern between situations where one is confident about the assignment of missing data items (e.g. the unknown orientation of a particle with respect to its template) and situations where based on the current model such confidence is not justified. Instead of taking “hard” decisions in the form of discrete assignments, in the maximum likelihood approach one calculates probabilities for all possible assignments, and the model parameters are obtained as a probability-weighted averages over all possibilities. However, if the statistical model does not describe the experimental data adequately, incorrect probability distributions will lead to suboptimal behaviour of the refinement approach. Therefore, a careful consideration of the underlying data model is of crucial importance for the potentials of the statistical approach.
As mentioned in the introduction, the squared distance metric that underlies all currently employed maximum likelihood approaches in the field may be seriously affected by variations in background mean or signal brightness among the data. Such variations may be relatively common in cryo-EM data, where abundant levels of noise complicate the process of image normalization. In particular, differences in ice thickness or defocus value yield different signal-to-noise ratios in the particles, which upon normalization of the noise results in variations in the signal brightness. In addition, the presence of neighbouring particles or other artefacts in those areas used to estimate the power of the noise may affect both the background mean and the image brightness. The presence of normalization errors presents a handicap for the maximum likelihood approach compared to refinement techniques based on cross-correlation coefficients. In the latter, the normalized cross-correlation coefficient is invariant to the background mean and signal brightness. Therefore, although these variations in theory still result in ill-posed 3D reconstructions, in practice their effects on conventional refinement may often be ignored. Unfortunately, this is not the case for maximum likelihood refinements, as is illustrated by the results presented in this paper. For two structurally heterogeneous cryo-EM data sets we showed that normalization errors may affect ML3D classification to such an extent that they prevent the separation of the data into structurally homogeneous subsets.
This was our main motivation to propose an extended data model that accounts for normalization errors and to derive a corresponding expectation-maximization (-like) algorithm for the maximum likelihood classification of structurally heterogeneous projection data. The successful classification of the two cases shown indicates that the extended data model and the proposed algorithm may be useful assets to the field. Given this example, it should be relatively easy to derive similar algorithms for other maximum likelihood approaches in the field, like the 3D reconstruction of icosahedral viruses (Yin et al., 2003) or the alignment of 2D crystal images (Zeng et al., 2007). In addition, these principles could also be useful for maximum likelihood approaches that are yet to be proposed, for example for sub-tomogram averaging (Förster et al., 2008).
In conclusion, we foresee that the growing importance of statistical approaches in 3D-EM image processing will be accompanied by an increasing interest in their underlying data models. Experimental data may contain many more surprises that make our currently employed data models suboptimal. In that context, we hope that this paper may contribute to a continuing, community-wide discussion on better statistical models for 3D-EM image formation.
Ribosome samples were prepared as described in (Julián et al., 2008) and diluted to 32 nM final concentration. Cryo-EM grids were prepared following standard procedures and micrographs were taken in low-dose conditions on a JEM-2200FS electron microscope. Images were recorded on a 4k×4k CCD camera at a magnification of 67,368×, resulting in a 2.2Å pixel size. Semi-automated particle picking from the SPIDER package (Frank et al., 1996) yielded 69,262 boxed particles of 160 × 160 pixels.
Human RNA polymerase II (hRNAPII) was immunopurified from HeLa cell nuclei as previously described (Kostek et al., 2006). Alu RNA was provided by James Goodrich's laboratory (Mariner et al., 2008). hRNAPII was diluted to a final concentration of 60nM and incubated with 120nM Alu RNA. Cryo-EM grids were prepared according to standard procedures. EM data were collected on films (Kodak SO163) in a Tecnai 20F microscope (FEI) operated at 200kV and 50,000× magnification, under low dose conditions. Micrographs were digitized with a Nikon Super Coolscan 8000 with a 12.71 μm raster size resulting in a pixel size of 2.54 Å. The boxer software from EMAN (Ludtke et al., 1999) was used to pick semi-automatically 31,219 (120 × 120 pixel) particle images.
All subsequent image processing operations were performed in the Xmipp package (Sorzano et al., 2004b). To reduce the computational costs of the maximum likelihood refinements, all data were downscaled using B-spline interpolation. The ribosome data were scaled to images of 64 × 64 pixels with a final pixel size of 5.5 Å/pixel; the RNA polymerase II data were scaled to 60 × 60 pixels with a final pixel size of 5.08 Å/pixel. All downscaled images were normalized using the following protocol for every image: (i) a background area was defined as those pixels outside a central, circular area of the image with a user-defined radius; (ii) a least-square plane was fitted through the pixels in the background area and subtracted from the entire image; and (iii) the resulting image was divided by the remaining standard deviation of the pixels in the background area. The radius of the background area circle was set to 30 pixels for the ribosome data and to 28 pixels for the hRNAPII data.
ML3D classifications were performed as described previously (Scheres et al., 2007a). For the seed generation the initial, average 3D reconstruction of all ribosome particles was low-pass filtered to 80 Å, the initial map for the hRNAPII data was filtered to 75 Å. The MLn3D runs were started from the same seeds as the conventional ML3D classifications, and all multi-reference refinements were stopped after twenty iterations.
We thank the Barcelona and the Galicia Supercomputing Centers (BSC-CNS and CESGA) for providing computer resources, James Goodrich for providing the human Alu RNA and Cameron L. Noland for his contribution to data collection in the hRNAPII study. Funding was provided by the Spanish Ministry of Science (CSD2006-00023, BIO2007-67150-C03-1/3) and Comunidad de Madrid (S-GEN-0166-2006), the European Union (FP6-502828), the US National Heart, Lung and Blood Institute and the National Institutes of Health (R01 HL070472, R01 GM63072). E.N. is a Howard Hughes Medical Institute investigator. The content of this work is solely the responsibility of the authors and does not necessarily represent the official views of the National Heart, Lung and Blood Institute or the National Institutes of Health.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.