Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2924593

Formats

Article sections

- Abstract
- 1. INTRODUCTION
- 2. BOOTSTRAPPING IN THE SINGLE PARTICLE METHOD
- 3. RESULTS
- 4. DISCUSSION AND CONCLUSIONS
- 6. REFERENCES

Authors

Related links

Proc IEEE Int Symp Biomed Imaging. Author manuscript; available in PMC 2010 August 20.

Published in final edited form as:

Proc IEEE Int Symp Biomed Imaging. 2010 April 14; 2010: 169–172.

doi: 10.1109/ISBI.2010.5490386PMCID: PMC2924593

NIHMSID: NIHMS180400

See other articles in PMC that cite the published article.

In single-particle reconstruction methods, projections of macromolecules at random orientations are collected. Often, several classes of conformations or binding states coexist in a biological sample, which requires classification, so that each conformation can be reconstructed separately. In this work, we examine bootstrap techniques for classifying the projection data. When these techniques are applied to variance estimation, the projection images (particles) are randomly sampled with replacement from the data set and a bootstrap volume is reconstructed from each sample. In a recent extension of the bootstrap technique to classification, each particle is assigned to a volume in the space spanned by the bootstrap volumes, such that the projection of the assigned volume best matches the particle. In this work we explain the rationale of these techniques by discussing the nature of the bootstrap volumes and provide some statistical analyses.

In single-particle reconstruction methods [1], projections of macromolecules at randomly unknown orientations are collected by a transmission electron microscope. Often, several classes of conformations or binding states coexist in a sample. To obtain structures with high accuracy, it is required to separate the classes before reconstruction of the macromolecule takes place. In this work, we take a close look at bootstrap techniques for classifying the projection data. In the bootstrap techniques for variance estimation [2], the projection images (or particles) are randomly sampled with replacement from the data set and a bootstrap volume is reconstructed from each sample, assuming the orientations to be known. In a recent extension of the bootstrap technique to classification [3], each particle is assigned to a volume in the space spanned by the bootstrap volumes, such that the projection (in the same orientation as the particle) of the assigned volume best matches the particle. Then, a clustering algorithm applied to the assigned volumes determines the class to which the particle belongs. In this work we explain the rationale of these techniques by discussing the nature of the bootstrap volumes: i.e., how they relate to the underlying structural classes. Furthermore, several statistical analyses should become easy to apply in our framework. Finally, the way the particles are assigned to volumes in the space spanned by the bootstrap volumes is closely examined, and our proposed solution differs from that given in [3].

In Section 2 we discuss the nature of the bootstrap volumes and the effect of noise, as well as the classification method based on the analysis of the bootstrap volumes (‘bootstrap classification’). Section 3 shows the results obtained by bootstrap classification for simulated and experimental data and a comparison of the bootstrap method with a maximum likelihood classification approach [4]. Finally, discussion and conclusions are provided in Section 4.

The aim of the bootstrapping technique [6] is to estimate the sampling distribution of an estimator by sampling with replacement from a given sample. It is a general-purpose approach to statistical inference, which circumvents the problem posed by the unavailability of large sample-size data.

If we repeatedly sample, with replacement, the projection data and reconstruct a 3D volume from each sample (assumed the corresponding orientations are known), we obtain an estimate of the probability distribution that reflects the “variability” in the data. This variability, which is estimated as variance of the bootstrap volumes, is not only due to the presence of different conformational or binding states, which is the goal in 3D variance estimation [2], but it also comes from imperfections in the data collection such as instrument shot noise, “background” noise, reconstruction artifacts, contrast transfer function effect [1], alignment error, etc. Two major sources of variance are those due to the coexistence of different conformational or binding states and instrument noise. Unlike the latter, which is characteristics for 2D projection data, the former is three-dimensional in nature. Therefore, care must be taken when relating the two. Here we attempt to establish such a relation, by describing how the bootstrap volumes relate to the underlying true structures of the classes.

We show that the bootstrap volumes are in fact approximations to convex combinations of the true structures. For simplicity in the discussions, we consider the case where the projection data come from a molecule occurring in *M* = 2 conformations. The analysis for more than two conformations follows straightforwardly. Assuming a discrete model, given the data $y\in {\mathbb{R}}^{I}$, the least-squares estimator is a popular criterion^{1} for finding the true volume $x\in {\mathbb{R}}^{j}$:

$${x}_{\mathit{LSQ}}=\mathrm{arg}\underset{x}{\mathrm{min}}\parallel y-Rx\parallel ,$$

(1)

where *R* is the discrete Radon transform and is the Euclidean norm. Assuming that *R’R* is invertible (*R’* denotes the transpose of *R*), the solution to (1) has a closed form

$${x}_{\mathit{LSQ}}={\left({R}^{\prime}R\right)}^{-1}{R}^{\prime}y.$$

(2)

In bootstrapping, data come from the two structures. Let ${y}_{1}\in {\mathbb{R}}^{{H}_{1}I}$ and ${y}_{2}\in {\mathbb{R}}^{{H}_{2}I}$ be the respective sampled projection data, where *H _{i}* is the number of projection images taken from class

$${y}_{i}={R}_{i}{x}_{i},$$

(3)

for *i* = 1, 2. Without loss of generality, we can set the sampled data to be $y=\left[\begin{array}{c}\hfill {y}_{1}\hfill \\ \hfill {y}_{2}\hfill \end{array}\right]$ and $R=\left[\begin{array}{c}\hfill {R}_{1}\hfill \\ \hfill {R}_{2}\hfill \end{array}\right]$. Substituting the values of *y* and *R* in (2), we obtain an expression for the reconstructed bootstrap volume, based on the least-square criterion,

$${x}_{\mathit{BS}}={({R}_{1}^{\prime}{R}_{1}+{R}_{2}^{\prime}{R}_{2})}^{-1}({R}_{1}^{\prime}{y}_{1}+{R}_{2}^{\prime}{y}_{2}).$$

(4)

Taking in to account (3) and the fact that ${R}_{1}^{\prime}{R}_{1}+{R}_{2}^{\prime}{R}_{2}={R}^{\prime}R$, the bootstrap volume can be viewed as a sum of linear transformations of the true volumes *x*_{1} and *x*_{2}

$${x}_{\mathit{BS}}={\left({R}^{\prime}R\right)}^{-1}({R}_{1}^{\prime}{R}_{1}{x}_{1}+{R}_{2}^{\prime}{R}_{2}{x}_{2}),$$

(5)

whose linear transformations ${\left({R}^{\prime}R\right)}^{-1}{R}_{i}^{\prime}{R}_{i},i=1,2$, sum up to the identity matrix of appropriate size.

We have proved that a bootstrap volume is a sum of linear transformations of the true classes. In fact, this sum is an approximation to a convex combination of the classes. To see this, we note that the effect of ${R}_{i}^{\prime}{R}_{i},i=1,2$, is essentially a blurring with a kernel that goes like 1/*r* (*r* is the radial distance; [8]) multiplied by a factor that is proportional to the number of projection images *H _{i}* taken from class

$${x}_{\mathit{BS}}\simeq \frac{{H}_{1}}{H}{x}_{1}+\frac{{H}_{2}}{H}{x}_{2}.$$

(6)

It is easy to see that the right hand side of (6) corresponds to summing volumes from Bernoulli trials with support {*x*_{1}, *x*_{2}}, probability *p* (whose realization is *H*_{1}/*H* in this case), and dividing the sum by the total number of projections *H*:

$${x}_{\mathit{BS}}\simeq \frac{1}{H}\sum _{h=1}^{H}{x}^{h};\text{where}\phantom{\rule{thickmathspace}{0ex}}{x}^{h}=\{\begin{array}{cc}{x}_{1}\phantom{\rule{1em}{0ex}}w.p.\hfill & p\hfill \\ {x}_{2}\phantom{\rule{1em}{0ex}}w.p.\hfill & 1-p\hfill \end{array}\phantom{\}}.$$

(7)

That is, most of the bootstrap volumes are located near the center of the convex hull with vertices *x*_{1} and *x*_{2}. A concentration near the vertices would be more desirable from the point of view of estimating the convex hull.

Imperfections in the data come from the electron optics (astigmatism, spherical aberration, etc.), background noise, shot noise, alignment error, etc. Let us assume an additive noise model for the 2D projection data, such that it can be “back-projected,” leading to an additive noise model for the 3D volumes. That is, in (3) we have that, for *i* = 1, 2, *y _{i}* =

An immediate consequence of the fact that the bootstrap volumes are convex combinations of the class volumes is that the space spanned by the bootstrap volumes approximates that spanned by the class volumes. Hence, for each projection image we can restrict ourselves in that space to estimate the volume that generated that projection. Ideally, these estimated volumes cluster around the true class volumes.

We now proceed to consider the case of *M* ≥ 2 classes. Suppose we have generated a sufficient number of bootstrap volumes and *H* is large enough, so that their principal directions are close to the principal directions of the space of the class volumes. Let ${z}_{1},\dots ,{z}_{N}\in {\mathbb{R}}^{J}$ be the resulting eigen-volumes and *z*_{0} the average volume. Given a projection image *y _{P}*, we wish to find an element

$$\mid \mid \sum _{n=1}^{N}{\alpha}_{n}P{z}_{n}+P{z}_{0}-{y}_{p}\mid \mid .$$

(8)

To avoid shift and scale variabilities, *y _{P}* and

Algorithm 1 summarizes our proposed approach to classification using bootstrapping, which is the same as the existing algorithm [3], except for the way in which the coefficients are determined. In [3], apparently the * _{n}* are set to be the inner product between

We tested our proposed algorithm on experimental and simulated data. The experimental data set consists of ten thousand 130 × 130 particles randomly chosen from a larger data set, on which the maximum likelihood (ML) classification method (a popular alternative to ours) [5] was previously tested, giving rise to two main structures: the 70S *E. coli* ribosome in the classical and hybrid state (see Fig. 1). For the simulated data set, we used these two states as phantoms and generated ten thousand 130 × 130 noisy projections in the exact same manner as described in [10]; i.e., the SNR was 0.06 and the CTF was applied. To gain computational speed, we decimated the particles to size 65^{2} in both data sets, aligned them to a library of reference projections (on a ten-degree angular grid) of a common 3D reference (the density map of the ribosome in one of the two states), and used SPIDER [11] to generate forty thousand bootstrap volumes in each case. It was necessary to filter the volumes, and for that we used a low-pass filter with cut-off 0.1, which was limited to the first lobe of the CTF [9] (though this value was not optimized). We relied on SPARX [12] to perform the eigen-decompositions. The clustering of the coefficient vectors *α* was performed via the *k*-means algorithm. To assess the classification performance, we also tested the ML method on the simulated data set.

Fig. 2 shows the result of classification using our version of the bootstrapping classification method. We used five eigen-volumes (*N* = 5) and looked for two classes (*M* = 2). One can immediately recognize the differences of the two structures in the presence/absence of the EF-G and the position of the L1 stalk. Not visible is the presence/absence of the A-site tRNA, which is another difference that the algorithm was able to pull out.

As measured by a classification error score whose minimum is 0% (perfect classification) and the maximum is 50% (random guess), the bootstrapping method (with *N* = *M* = 2) yields 16%±0.2% (see Fig. 1) versus 34%±10% for the ML approach [4] (refinement angle of ten degrees, two classes, 20 iterations). The confidence interval of the classification error was obtained by running the respective algorithms ten times. The large dispersion of the figure in the ML method is likely due to the presence of local maxima (which is not an issue in the bootstrapping approach, except for the *k*-means algorithm) and the relatively small data size.

We have explained the rationale of bootstrapping in the context of classification and proposed an algorithm the differs from the one initially proposed in an important detail. Through repeated reconstructions from bootstrap samples, we can estimate the space spanned by the underlying class structures. By searching in this space a volume whose projection best matches a given particle helps us decide on the class to which the particle belongs. We show that the bootstrapping approach offers a competitive alternative to current popular methods, such as the ML approach: the former does not suffer from local maxima effect (except for the clustering algorithm, if *k*-means is used). It is noted that in the experiments, the angular assignment of the projection data was done only once, at the beginning, and with respect to one reference volume. An iterative process, in which the angles are refined with respect to the current reconstructed class volumes, should provide even better results. Further improvement may also come from alternative ways of finding the coefficient vector *α*, since the Euclidean distance in (8) is sensitive to outliers. Finally, it should be noted that classification becomes more challenging as the variability of the structure classes competes with noise in the data, both of which are scaled down by the number of particles used in the sampling. Hence, to reduce the noise, it is necessary to find ways other than increasing the number of particles; for that, if filtering is used, the loss of high frequency information can be detrimental for the classification.

We are grateful to Zhi-Quan (Tom) Luo for help with optimization.

^{1}For instance, the well known SIRT algorithm can be viewed as a gradient descent algorithm for finding *x _{LSQ}* [7].

[1] Frank J. Three-Dimensional Electron Microscopy of Macromolecular Assemblies. Oxford University Press; New York: 2006.

[2] Penczek PA, Yang C, Frank J, Spahn CMT. Estimation of variance in single-particle reconstruction using the bootstrap technique. J. Struc. Biol. 2006;154:168–183. [PubMed]

[3] Spahn CMT, Penczek PA. Exploring conformational modes of macromolecular assemblies by mutli-particle cryo-EM. Current Opinion in Structural Biology. 2009;19:623–631. [PMC free article] [PubMed]

[4] Scheres SHW, Valle M, Grob P, Nogales E, Carazo JM. Maximum likelihood refinement of electron microscopy data with normalization errors. J. Struc. Biol. 2009;166:234–240. [PMC free article] [PubMed]

[5] Scheres SHW, Gao H, Valle M, Herman GT, Eggermont PPB, Frank J, Carazo JM. Disentangling conformational states of macromolecules in 3D-EM through likelihood optimization. Nat. Methods. 2007;4:27–29. [PubMed]

[6] Efron B. Bootstrap methods: Another look at the jack-knife. The Annals of Statistics. 1979;1:1–26.

[7] Herman GT. Image Reconstruction from Projections: The Fundamentals of Computerized Tomography. Academic Press; New York: 1980.

[8] Deans SR. The Radon transform and some of its applications. John Wiley & Sons; New York: 2006.

[9] Zhang W, Kimmel M, Spahn CM, Penczek PA. Heterogeneity of large macromolecular complexes revealed by 3D cryo-EM variance analysis. Structure. 2008;16:1770–1776. [PMC free article] [PubMed]

[10] Baxter WT, Grassucci RA, Gao H, Frank J. Determination of signal-to-noise ratios and spectral SNRs in cryo-em low-dose imaging of molecules. J. Struc. Biol. 2009;166(2):126–132. [PMC free article] [PubMed]

[11] Shaikh TR, Gao H, Baxter W, Asturias FJ, Boisset N, Leith A, Frank J. SPIDER image processing for single-particle reconstruction of biological macromolecules from electron micrographs. Nat. Protoc. 2008;3:1941–1974. [PMC free article] [PubMed]

[12] Hohn M, Tang G, Goodyear G, Baldwin PR, Huang Z, Penczek PA, Yang Ch., Glaeser RM, Adams P, Ludtke SJ. SPARX, a new environment for cryo-em image processing. J. Struct. Biol. 2007;157:47–55. [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |