|Home | About | Journals | Submit | Contact Us | Français|
Sedimentation velocity analytical ultracentrifugation has become a very popular technique to study size distributions and interactions of macromolecules. Recently, a method termed two-dimensional spectrum analysis (2DSA) for the determination of size-and-shape distributions was described by Demeler and colleagues (Eur Biophys J 2009). It is based on novel ideas conceived for fitting the integral equations of the size-and-shape distribution to experimental data, illustrated with an example but provided without proof of the principle of the algorithm. In the present work, we examine the 2DSA algorithm by comparison with the mathematical reference frame and simple well-known numerical concepts for solving Fredholm integral equations, and test the key assumptions underlying the 2DSA method in an example application. While the 2DSA appears computationally excessively wasteful, key elements also appear to be in conflict with mathematical results. This raises doubts about the correctness of the results from 2DSA analysis.
The use of sedimentation velocity analytical ultracentrifugation (SV) has significantly expanded in the last decade (Howlett et al. 2006; Scott and Schuck 2006; Cole et al. 2008), and new computational methods for SV analysis are being actively developed by several groups (Balbo et al. 2005; Philo 2006; Brown et al. 2007, 2009; Behlke and Ristau 2009; Brookes et al. 2009; Correia and Stafford 2009). In particular, diffusion-deconvoluted sedimentation coefficient distributions calculated from direct boundary modeling of experimental data (Schuck 2000; Schuck et al. 2002) have proven to be very useful tools in many biophysical applications (for a list of references see Schuck 2007). They can achieve relatively high hydrodynamic resolution of pauci- and polydisperse macromolecular mixtures, exhibit exquisite sensitivity for trace components (Berkowitz 2006; Liu et al. 2006; Brown et al. 2008a, b; Gabrielson et al. 2009), and can be related to sedimentation coefficient isotherms and Gilbert–Jenkins theory for the analysis of slowly or rapidly interacting systems (Dam and Schuck 2005; Dam et al. 2005). The extension of sedimentation coefficient distributions to two-dimensional size-and-shape distributions was introduced (Schuck 2002; Brown and Schuck 2006) and applied in numerous studies (Markossian et al. 2006; Chang et al. 2007; Deng et al. 2007; Race et al. 2007; Broomell et al. 2008; Brown et al. 2008; Chebotareva et al. 2008; Iseli et al. 2008; Moncrieffe et al. 2008; Paz et al. 2008; Sivakolundu et al. 2008; Wang et al. 2008; Eronina et al. 2009; Mortuza et al. 2009). More recently, the Demeler laboratory has described the concept of a novel algorithm (“2DSA”) for determining size-and-shape distributions, as implemented in the software ULTRASCAN (Brookes et al. 2006, 2009; Demeler et al. 2009). In the present work, we critically compare the background of the different algorithms and assess their performance.
The SV experiment was carried out with a Beckman-Coulter XL-I analytical ultracentrifuge, following standard protocols as described by Brown et al. (2008a, b). A monoclonal immunoglobulin G (IgG) preparation in phosphate-buffered saline (PBS) buffer was inserted in 12-mm Epon centerpieces, temperature equilibrated at 18°C, and then accelerated to 45,000 rpm and scanned with absorbance optics at 280 nm. Data analysis was performed with SEDFIT 11.8 using c(s) models as described by Schuck et al. (2002), the two-dimensional size-and-shape model c(s, fr) as described by Brown and Schuck (2006), and applying Bayesian prior knowledge as described in detail by Brown et al. (2007). The computer used for these analyses was a Dell Precision T5400 workstation, with dual 32-bit quadcore 3.16-MHz processors and Windows operating system.1
For clarity of the analysis of the algorithms, we first provide a mathematical outline of the problem. This is followed by a more detailed discussion of appropriate discretization parameters, and from this we derive the demands on the computational platforms. Then we discuss algorithmic aspects for calculating Lamm equation solutions and for computing a size-and-shape distribution from the experimental data, and finally comment on methods for estimating their true information content.
The size-and-shape distribution problem is a Fredholm integral equation of the form
where the data a(r, t) are the measured evolution of the radial signal profiles, and c(s, fr) is a differential size-and-shape distribution, expressed most conveniently for the modeling of SV data in coordinates of sedimentation coefficient s and frictional ratio fr (Brown and Schuck 2006). χ(s, fr, r, t) are normalized solutions of the Lamm equation (Lamm 1929)
which predicts the evolution of the concentration profiles of an ideally sedimenting species with sedimentation coefficient s and diffusion coefficient D(s, fr) that is initially uniformly distributed between the meniscus and bottom of the solution column at loading concentration of 1.
Equation (1) can be discretized on a rectangular grid with (S × F) size-and-shape values (si, fr,j) comprising all combinations of S equidistant sedimentation coefficient values from s1 = smin to sS = smax (with constant mesh size ), and F frictional ratio values from fr,1 = fr,min to fr,F = fr,max (with constant mesh size ). With the data being (N × M) discrete signal values at radius rn and time tm, abbreviated as anm, (1) leads to the linear least-squares problem
The ci,j provide an estimate of the size-and-shape distribution with . Signal offsets from systematic time-invariant [b(rn)] and radial-invariant [β(tm)] noise contributions are indicated in Eq. (3), but their simultaneous optimization with the method of separation of linear and nonlinear parameters (Ruhe and Wedin 1980) poses no significant further complications (Schuck and Demeler 1999) and therefore they will be dropped from further consideration in order to make the notation more transparent in the following.2
We can introduce a new index l that lexicographically orders all data points (a total of L = N × M), and a single index k that enumerates all size-and-shape grid points (si, fr,j) from 1 to K = S × F, which allows us to write (3) as a simple sum
This highlights the nature of the problem being a standard nonnegative linear least-squares problem. The unconstrained problem can be solved with the method of normal equations (Lawson and Hanson 1974; Golub and VanLoan 1989)
with the K × K matrix P (sometimes referred to as the Gram matrix) with elements , the K × 1 vector with elements , and the K × 1 vector representing the unknown distribution.
The unique best-fit solution with the nonnegativity constraint ck ≥ 0 can be found unambiguously with the algebraic algorithm NNLS, which was introduced and proven by Lawson and Hanson (1974). We first used the NNLS algorithm in the context of SV distribution analysis, in a form where we expressed all requisite quantities with elements of the normal equations (Schuck 2000). NNLS is an active set algorithm that divides the unknowns into sets with active (ck = 0) and inactive (ck > 0) inequalities, and iteratively establishes the active set producing the best-fit solution. For the inactive set, the problem takes the same form as (5), but with all matrix and vector elements from components with active constraints deleted (Gill et al. 1986).
Frequently the problem of fitting distributions of the form (1) is ill posed, meaning that many different solutions will fit the data statistically indistinguishably well (Louis 1989; Hansen 1998; Engl et al. 2000). For example, Provencher (1982) has illustrated this point via the Lemma of Riemann–Lebesgue, showing that one should expect a large set of very different solutions to fit the data equally well within the experimental error. In practice, noise of the data can amplify to determine even the overall features of the best-fit solution and often the strictly best-fit solution consists of a series of spikes whose number, location, and height may not reflect the presence of such species in the physical experiment, but are governed by the details of the noise and other imperfections in the data.
It is therefore desirable to suppress, among all possible solutions, those that contain a potentially misleading amount of detail arising from noise amplification. Towards this goal, regularization is a standard approach that determines the most parsimonious solution of all that fit the data statistically indistinguishably well. It minimizes a measure of the information content of the solution while optimizing the quality of fit. A well-known and widely applied strategy to suppress artificial spikes is Tikhonov–Phillips regularization (Phillips 1962; Provencher 1982; Louis 1989; Hansen 1992; Press et al. 1992), which uses, for example, the square of the second-derivative matrix (Hkκ) to stabilize the solution of (4):
or, formulated with normal equations,
where α is a parameter that scales the regularization constraint (Louis 1989; Press et al. 1992). Again, (7) has an unambiguous best-fit solution that can be determined algebraically with NNLS for any value of α, and the latter can be adjusted in a simple one-dimensional search such that the least-squares fit remains at a statistically indistinguishable quality compared with the initial best fit in the absence of regularization (Bevington and Robinson 1992). A Bayesian variation of this approach is possible that modulates the regularization matrix to enhance the information content of the solution in view of existing (or hypothesized) prior knowledge (Sivia 1996; Brown et al. 2007; Patel et al. 2008).
We will refer to this approach as the “standard algorithm,” because it is firmly rooted in textbook linear algebra and basic linear least-squares optimization, and utilized in many applications throughout the biophysical literature and physical sciences. We have introduced this approach previously into the SV analysis, and it underlies all size-distribution analyses in SEDFIT and SEDPHAT. If used without regularization, it provides exact solutions (within numerical precision) to the least-squares problem (3), and when used with regularization, the algorithms ensure that fits with statistically indistinguishable quality are obtained.
The 2DSA method by Demeler and colleagues aims to solve the same Eqs. (1), (3), and (4), respectively. This is described by Brookes et al. (2009), and with less mathematical detail by Demeler et al. (2009). The Demeler approach deviates in key aspects from the strategies described above. Apparently in order to circumvent perceived computational limitations, a novel multigrid scheme is conceived that would allow a sequence of fits with low-resolution 10 × 10 (S × F) grids to approximate the solution of (1) and (3) with high-resolution S 10 and F 10. For achieving parsimonious results Monte Carlo iterations are applied (Brookes et al. 2009). Some of the key ideas will be discussed in the following.
First, in order to assess the size of the problem and computational requirements, we need to clarify how fine the grid of s-values and fr-values needs to be in order to fully extract all information from a typical set of sedimentation velocity data. Let us consider as an example the experimental data from a preparation of IgG molecules sedimenting at 45,000 rpm, as shown in Fig. 1a. It is useful to start the analysis with a one-dimensional sedimentation coefficient distribution analysis c(s), since the sedimentation coefficients are the experimentally best determined quantities. c(s) eliminates the shape dimension by using hydrodynamic scaling laws such as the traditional s ~ M2/3 law for globular particles (Schuck 2000), theoretical models for wormlike chains (Yamakawa and Fujii 1973) or any user-defined exponential scaling laws for polymers (Pavlov et al. 2009). For the given data we can determine from the c(s) analysis (not shown) that s-values from 0.1 to 15 S will be sufficient to describe all sedimenting species. Equidistant discretizations with S = 100 or S = 200 lead to statistically indistinguishable quality of fit, as measured by F-statistics (Bevington and Robinson 1992; Straume and Johnson 1992), and therefore we preliminarily conclude that S = 100 will be a reasonable choice.
Typically, the resolution in the frictional ratio dimension cannot be expected to be very high, even in combination with data from SV experiments at a range of rotor speeds (Schuck 2002). Therefore, a discretization providing F = 10 values between 1.0 and 2.5 (ranging from extremely compact to very extended protein structures) should be a sufficiently flexible basis to describe the actual frictional ratio for each species (knowing that we have inserted folded proteins into the sample solution, and keeping in mind the average frictional ratio of 1.68 estimated from the c(s) analysis). The resulting 10 × 100 grid with a total of K = 1,000 species was fitted with the standard algorithm to the data in Fig. 1a, leading to virtually random distribution of residuals ((1b),1b), with a root-mean-square deviation (rmsd) of 0.00672, consistent with the noise in the data acquisition. The resulting distribution is shown with and without regularization in Fig. 1d and c, respectively. As expected, while the s-values of the species are well defined, the shape dimension is highly underdetermined, resulting in the typical series of peaks in in1c,1c, and in a smooth relatively uniform distribution after regularization in in1d.1d. (This justifies, in retrospect, the choice of F = 10 values as a sufficiently detailed discretization of the frictional ratio dimension.)
We can compare the rmsd achieved with this 10 × 100 grid (0.00672) with a fit under otherwise identical conditions but different grids: a coarser 10 × 50 grid leads to an rmsd of 0.00678, which is barely statistically worse (on a one standard deviation confidence level), and a finer grid with 20 × 200 grid leads to an rmsd of 0.00670, which is statistically indistinguishable. Thus, a 10 × 100 grid is of sufficiently high resolution to extract the entire information content of the experimental data.
After outlining the structure of the problem and the discretization parameters typically required for a size-and-shape analysis of SV data, it is possible to discuss the computational requirements. Demeler’s 2DSA method was implemented with the goal of accessing a high-performance computing environment (TeraGrid) in order to avoid prohibitive memory limitations that Demeler and colleagues perceive to occur when using ubiquitous ordinary laboratory workstations. Specifically, the authors (Brookes et al. 2009) estimate the memory needs for modeling a set of M = 50–100 sedimentation velocity scans with typically N = 500–800 data points each by only a low-resolution 10 × 10 (S × F) grid. They conclude that “Performing just a 10 × 10 grid search on such an array would require close to half a gigabyte of memory just for data storage of a single experiment.” (Brookes et al. 2009). We will examine this estimate in more detail.
In practice, when using the absorbance optics with the recommended and widely applied setting of 0.003 cm (Brown et al. 2008a, b) for the radial intervals, in order to diminish errors from sample migration during the scan (Brown et al. 2009), we obtain only on the order of ~200–250 points per scan in a long-column SV experiment. In typical high-speed SV experiments with eight-hole rotors, we can acquire usually only 50 scans or fewer before depletion occurs and/or migration and backdiffusion approach steady state, even with small solutes. This is sufficient for a highly detailed analysis of multicomponent systems, as discussed by Balbo et al. (2005). Predicted values χ(si, fr, j, rn, tm) need to be calculated for each species (si, fr, j) with arrays of the same size as the data a(rn, tm). Since the experimental data have a precision not better than four decimal places, their representation as a standard 32-bit floating-point data type with eight significant figures is already wasteful. Nevertheless, calculating conservatively with 32-bit floats we arrive at a memory requirement of only ~4.8 MB for storage of model data, rather than 0.5 GB [250 × 50 × 10 × 10 × 4 bytes × (1,048,576 bytes/MB)−1 = 4.76 MB]. With interference optical (IF) data, the native radial density of points is higher (~1,500 per scan). Since the radial density of points of interference scans is not exploited experimentally, it could be safely reduced to the level of absorbance data by pre-averaging, which reduces the statistical noise approximately by a factor of 2. However, again calculating conservatively and using the native resolution of IF data, this would lead to ~28 MB storage space, or ~57 MB if 100 scans were used to represent the evolution in a SV experiment.
We find that the ~5–50 MB actually required for calculating size-and-shape distributions with 10 × 10 grids is compatible with the available memory on many different platforms, ranging from >200,000 MB available on TeraGrid systems, to ~2,000–3,000 MB typically available on 32-bit Windows, and even the ~50–90 MB available on current smartphones.
Consistent with this result, we and others (Markossian et al. 2006; Chang et al. 2007; Deng et al. 2007; Race et al. 2007; Broomell et al. 2008; Brown et al. 2008; Chebotareva et al. 2008; Iseli et al. 2008; Moncrieffe et al. 2008; Paz et al. 2008; Sivakolundu et al. 2008; Wang et al. 2008; Eronina et al. 2009; Mortuza et al. 2009) have regularly used full high-resolution grids (such as 10 × 50, 10 × 100, or higher) in SEDFIT on ordinary personal computers or laptops, an exercise that is a regular part of the Workshop on Hydrodynamic and Thermodynamic Analysis of Macromolecules with SEDFIT and SEDPHAT at the National Institutes of Health (Schuck 2009). This is possible due to the fact that the memory requirement for the high-resolution grid would be 48–286 MB to store the model data (assuming 50 scans for data absorbance or native interference data modeled with a 10 × 100 grid). It is readily verified that, even for the complete high-resolution grid and when globally analyzing many experimental data sets, this is well below the memory limit of currently common 32-bit Windows operating systems.
Further, all computations can be condensed to the normal Eq. (5), requiring essentially only a matrix P of 1,000 × 1,000 numbers to be operated on, which even as double-precision data type requires less than 8 MB, trivial by current standards on any platform. Once condensed to the form of Eq. (5), our SV problem is far smaller (often several orders of magnitude) than common problems of analogous mathematical structure, for example, in astronomical image analysis (Narayan and Nityananda 1986). For the data shown in Fig. 1, in the implementation in SEDFIT (which does not optimize memory allocation), ~20 MB of RAM are used.
The necessary computational power will strongly depend on the implementation of the algorithms, of course. Parallelization can be readily achieved in the standard algorithm, which can in many places decrease the computation time by a factor virtually proportional to the number of threads. This is true, for example, for solving the Lamm equations, and for the most time-consuming step of building the normal equations matrix. The time for a complete calculation with a full high-resolution grid (10 × 100) for the data shown in Fig. 1, on a current dual-processor quadcore 3.16-MHz PC (Dell Precision T5400), is only 42 s.3 The time required for a 10 × 50 grid, which we have already seen leads to an adequate fit within the noise of data acquisition, is 10 s. Finally, it is ~15 min for a 20 × 200 grid. In the standard algorithm a Monte Carlo statistical analysis may be desired, for example, in order to examine the statistical accuracy of a particular species population as determined from the integration of the distribution in a certain range. In the standard algorithm, each iteration requires only updating the vector of the normal Eq. (5) and solving these equations. For the data shown in Fig. 1, one iteration takes ~3 s on a single thread on our PC.
We conclude that ordinary current workstations do not pose a limitation for rigorously determining the size-and-shape distributions, neither with regard to available memory, nor with regard to processor capabilities.
Modeling a distribution of species with different size and shape to the data depends critically on the accuracy of the Lamm equation solutions (2) that predict the sedimentation profiles for all species. For calculating Lamm equation solutions, Demeler and colleagues apply the ASTFEM algorithm that was recently introduced by Cao and Demeler (2005). In that work, the authors report two criteria for the performance of their ASTFEM algorithm in comparison with the reference (true) solution: (1) the overall rmsd (referred to by Cao and Demeler as “L2 error,” in a nonstandard definition), and (2) the maximum error in the evolution of concentration profiles.
That the rmsd is small (compared with the noise of data acquisition) is a necessary but not sufficient condition for the algorithm to be useful in modeling experimental data. In fact, the majority of points of the predicted concentration profiles typically fall into the plateau regions, which are trivial to predict (those in the solvent plateau are constant zero) but have limited or no information about the sedimentation process. These plateau points can keep the overall rmsd error of the solution below the statistical errors of the data acquisition, even though the maximum errors in the sedimentation boundaries may be much larger.
The accuracy of the description of the shape of the sedimentation boundary (rather than the plateaus) is critical for modeling the size-and-shape distributions. Therefore, a sufficient condition is that the maximum error is smaller than the noise of the data acquisition. For example, in order to model experimental data with signal-to-noise ratio of up to ~1000:1, the maximum errors of the Lamm equation solutions at unit concentration should be less than 0.001.
For numerically solving the Lamm equation, an overriding question is the discretization of the radial coordinate. Solutions with fine radial mesh are generally more accurate but computationally more expensive, and conversely, coarsely discretized Lamm equation solutions are quicker to calculate. Even though it has not been explicitly mentioned in the SV literature until recently (Brown and Schuck 2008), it is easy to see that a fundamental limitation for any finite-element algorithm with linear elements is the obligate error that occurs when approximating a smooth, curved function with piecewise linear segments. This is illustrated in Fig. 2 for a system chosen by Cao and Demeler (2005) as a benchmark in the introduction of their ASTFEM algorithm. Figure 2 shows the deviations of the curved, accurate solution from a series of linear segments with a total of only 100 (red) or 200 (blue) radius values from meniscus to bottom.
For the determination of suitable radial mesh sizes for calculating the Lamm equation solution, Cao and Demeler applied the L2 error criterion. This led to the recommendation of very coarse grids with ~100 points, and indeed the main benefit of the ASTFEM algorithm perceived by Cao and Demeler (2005) is numerical stability even for such very coarse radial grids.
Unfortunately, large maximum errors in the approximation of the sedimentation boundaries are an unavoidable consequence of coarse radial discretization. In fact, the errors in the sedimentation boundaries shown in Fig. 2 are similar in magnitude to those of Figs. 8b and 9b in Cao and Demeler (2005). Remarkably, none of the examples provided by Cao and Demeler (2005) led to maximum errors below 0.001, and in most cases it was a factor of 10 or more above this mark. Such errors can be expected to significantly impact the result of the size-and-shape distribution analysis.
We have recently derived a new finite-element algorithm (Brown and Schuck 2008) based on the recognition that the approximation of the concentration profiles as linear segments does not only generate an obligate error (independent of the algorithm), but that this also represents the dominant source of error in the finite-element approach as described by Claverie et al. (1975). Accordingly, we generate a set of nonequidistant radial grid points with optimal spacing to achieve Lamm equation solutions with constant, predetermined accuracy (as measured by the maximum error for the radial data range to be analyzed). To optimize the efficiency, all points in the solvent and solution plateaus are calculated with the trivial analytical expressions (Brown and Schuck 2008). We note that, for the 10 × 100 grid shown in Fig. 1, the calculation of the Lamm equation solutions for all 1,000 species with an accuracy of better than 0.001 (maximum error) requires a total of less than 2 s on our PC. Thus, computational expense for achieving high-accuracy Lamm equations should not be limiting the size-and-shape distribution analysis of SV data.
The 2DSA algorithm applied by the Demeler laboratory consists of a large number of repeated applications of Eq. (4) with K ≈ 10 × 10 and similarly low-resolution grids. Figure 3 shows the results of fitting the same data as shown in Fig. 1 with a 10 × 10 grid under otherwise identical conditions. The deviations are ±10% of the maximum signal, and clearly this model does not even qualitatively describe the data well. As a consequence, we cannot assume that the distribution obtained from this model reflects in any way the species present in the experiment. (It is grossly different, for example, from the distribution shown in Fig. 1c, d.)
Brookes et al. (2009) recognize that such a fit is insufficient and consistently attribute the idea of using 10 × 10 grids to Brown and Schuck. For example, the authors state “…a 10 × 10 grid as proposed by Brown and Schuck (2006) is insufficient to reliably describe the experimental parameter space. If the actual solute is not aligned with a grid point, false positives are produced,” and even declare the second major point in their results as “A 10 × 10 grid suggested by Brown and Schuck (2006) is clearly insufficient…,” and state in the summary “We have shown that low resolution grids as proposed by Brown and Schuck (2006) are insufficient to obtain reliable information.” This attribution is not based on reality. Unmistakably, we have used in the referenced work (Brown and Schuck 2006) exclusively high-resolution grids (11 × 100 in Fig. 1 and 2, 12 × 60 in Fig. 3, 15 × 50 in Fig. 4, 13 × 100 in Fig. 5, 11 × 100 in Fig. 6, and finally 13 × 50 in Fig. 7), all of which are shown to describe the data well to within the noise of data acquisition (Brown and Schuck 2006), and similar is true for other published applications of the method by other laboratories and by us. Thus, the idea of using 10 × 10 grids is entirely a product of the Demeler laboratory, and, to our knowledge, first described in the Brookes et al. (2009) paper.
Despite the failure of overly coarse grids, remarkably, Demeler’s 2DSA algorithm consists exclusively of repeat applications of such coarse grids: They are considered “subgrids” of a hypothetical grid with much higher resolution, which is never actually completely fitted to the data, but nevertheless suggested to reflect the final resolution of the distribution. The details are not entirely clear, but there are two key ideas: (I) The coarse grids are translated relative to each other multiple times by increments Δ2s and Δ2fr, and their results are joined. (II) The joined set of grid points with inactive nonnegativity constraints from (I) is used to form a new, second-stage irregular grid of similarly low number of grid points as the initial grid.4 The Demeler scheme of repeat application of different coarse subgrids, storage, and combination of their results, is termed a “divide-and-conquer” strategy. Divide-and-conquer algorithms are well-known tools in numerical mathematics that facilitate the use of parallel computation to solve problems, such as singular value decomposition (Arbenz and Golub 1988; Gu and Eisenstat 1995; Xu and Qiao 2008). Generally, such algorithms are established by proof of their correctness. This criterion has not been attempted for the 2DSA algorithm. Concerns arise from the following arguments:
The premise underlying (I) is that the results from independent application of different grids can be meaningfully combined. Following the idea of the Demeler laboratory that low-resolution subgrids can be “refined into a grid of any desired resolution” through their combination scheme, let us consider that putative final regular high-resolution grid, which would have mesh size Δs = Δ2s and Δfr = Δ2fr. As shown above, one can actually solve the size-and-shape distribution problem directly using the standard algorithm with this full-sized high-resolution grid with even mesh size, via the normal equation (5) with the K × K matrix P and K × 1 vector , where K is the total number of species of the two-dimensional grid. In our example of Fig. 1, K = 1,000 for the 10 × 100 grid that is of sufficient resolution to describe all aspects of the experiment. Now going backwards, one may consider our high-resolution grid to be represented by a total of Γ different equal-sized subgrids, each referenced with index γ (e.g., ten grids of 10 × 10 resolution), such that merging all grid points of the subgrids produces the high-resolution grid. For each subgrid, one can solve the distribution with the normal matrix method, and it is easy to show that the relevant matrix equations are where are square submatrices of P of size (K/Γ) × (K/Γ) and are subvectors of of size 1 × (K/Γ). One can use a nomenclature for the elements of the high-resolution grid such that the points are ordered in sequence of the low-resolution subgrids.
In general, it is not true that the individual results from the individual problems can be combined to a concatenated vector that would represent the result of the full solution (with or without nonnegativity). This would require the cross-correlation between points from the different grids to vanish, and the high-resolution K × K matrix P to have a structure
This is not the case, as illustrated in Fig. 4 for our example data. As can be discerned clearly, the structure of P when sorted along subgrids (Fig. 4b) is different from merging the submatrices P(γ) (Fig. 4c), which neglects very substantial features of the model. If we ignore this problem and calculate the distribution with the matrix of Fig. Fig.4c4c (or, equivalently, if we simply merge all solutions from consecutively fitting the distribution data with all ten 10 × 10 grids and plot them at their appropriate points in the high-resolution grid), we arrive at a size-and-shape distribution as shown in Fig. 5b. This is very different from the known exact solution shown in Fig. 1, which is reproduced for convenience in Fig. 5a.
Apparently to address this problem, the 2DSA method takes from the concatenated solution of the subgrids only the pattern of active/inactive nonnegativity constraints. Demeler and colleagues construct from the points with nonzero concentration values in the concatenated partial solutions (or a subset thereof) a new grid, conceived to be equal in size to the original low-resolution grids, but now with uneven spacing of the grid points.
Again, we can analyze this approach best by comparison with the full, high-resolution grid with the full matrix P, where the unambiguous best-fit nonnegative solution is found exactly with the proven NNLS algorithm (Lawson and Hanson 1974). The ad hoc exclusion of grid points that did not produce positive concentration values in any of the subgrids is in direct conflict with NNLS. Nothing guarantees that the (s, fr)-values populated in the exact solution will be correctly recognized as populated species (be assigned nonzero values) in the fit with the low-resolution grid of which they are a part. The points populated in the exact solution may therefore simply not be part of the second-stage grid.
Illustrating this problem, the crosses in Fig. 5d represent all the grid points that made positive contributions in any of the preliminary sequence of low-resolution fits (which covers all grid points of the high-resolution grid, as described above). All the dots (red and blue) are the positive solution components of the exact high-resolution solution. They are colored blue if they coincide with a cross, i.e., have been correctly identified in the first stage as being part of the solution, and they are colored red if they were never part of any low-resolution fit and were therefore excluded from the second-stage grid. If the analysis proceeds with the second-stage grid (i.e., preconstraining the analysis to the values indicated by crosses in Fig. 5d), we arrive at the solution shown in Fig. 5c. This is very different from the true high-resolution solution shown in Fig. 5a. Thus, the second stage cannot correct for the errors that occur from a naïve subdivision of grids in (I).
Brookes mentions multiple stages of the sequence (I) and (II) with different mesh sizes Δ2s and Δ2fr, and an “iterative refinement” of the procedure that utilizes in stage (I) the coarse starting grids that have been extended with populated points from the results of stage (II) of the previous iteration (Brookes et al. 2009). The same fundamental concerns apply to this iteration. To the extent that the results from (II) may not contain the grid points of the exact solution, it is unclear how the inclusion of these additional grid points would aid in the recognition of the correct solution. Even if the added grid points in (I) do represent part of the correct solution, it is not certain that they would be correctly maintained as part of the solution after (II). Empirically, the Demeler laboratory reports convergence of this iteration series in the absence, but not in the presence, of systematic noise corrections to the data (Brookes et al. 2006). Even if the iteration does converge, it is unclear whether it is convergent to the correct solution.
Since the 2DSA algorithm never actually applies a model with a full regularly spaced high-resolution grid, the traditional regularization methods, such as Tikhonov or maximum-entropy regularization described above, do not seem to be easily applicable. In fact, Brookes et al. (2009) express the view that the fit of c(s, fr) with a high-resolution grid in conjunction with regularization suffers from “lack of resolution,” and “produces unnecessarily broad molecular weight distributions.” We believe that, if prior knowledge about the sharpness of the expected c(s, fr) peaks is available, this can be inserted with a Bayesian refinement of the Tikhonov regularization as we have reported for SV analysis (Brown et al. 2007) and implemented in SEDFIT.
In the absence of such prior knowledge, however, the resolution of the regularized solution is limited not by the analysis (assuming reasonable discretization), but rather by the information content of the experiment. It is important to recognize the nature of this limit, in order not to overinterpret the data. Of course, it also would be trivial, although usually misguided, to perform a distribution analysis simply not applying this regularization step at all, and to rely on the exact solution of the fit with the high-resolution model, which usually produces artifactual detail that is the result of noise amplification due to the ill-conditioned nature of the basic Eq. (3).
In our example, these aspects can be discerned when comparing the most parsimonious solution in Fig. 1d from Tikhonov regularization with the spiky exact solution in Fig. 1c, or with the incorrect solution in Fig. 5c from one iteration adapted from the Demeler scheme. Even though the spiky solutions suggest very few and discrete species to be in solution, the smooth Tikhonov solution fits the data indistinguishably well from the exact best-fit solution. Its nearly featureless appearance in the fr-dimension highlights simply the lack of sufficient information in the raw data in order to determine the fr-values well.
In order to address the impact of noise and error amplification on the results of the 2DSA algorithm, it was combined by Brookes et al. (2009) with a Monte Carlo analysis. Fifty iterations were performed by the Demeler laboratory in order to determine 95% confidence intervals. This seems to be an unusually low number of iterations, in particular since the high confidence limits require estimating the quantiles of rare events, in this case the 2.5 and 97.5 percentiles. With 50 iterations, they are determined by the extreme 1.25 occurrences of parameter values, which makes these estimates of the confidence intervals quite variable statistical quantities themselves. As is well known, usually the number of Monte Carlo iterations required to produce meaningful results is typically on the order of 1,000–10,000. However, it seems this would lead to excessive computational effort, several orders of magnitude more costly than the Tikhonov regularization in the standard approach with the full high-resolution grid, which requires for our standard example only a few seconds on our PC.
The authors report confidence intervals for molecular parameters of the identified solutes, but it is not clear whether these were determined (1) by statistical analysis of the results obtained in each Monte Carlo iteration after the integration of putative solute peaks, or (2) if these confidence intervals reflect the uncertainties of the putative solute peaks in a distribution that, as a whole, has gained error bars at each grid point from the statistics of the Monte Carlo iterations. For method (1), the problem arises of how to identify the group peaks representing a putative solute species. For method (2), the question arises of whether the Monte Carlo approach is effective in providing parsimonious “average” distributions.
Generally, Monte Carlo simulations are not part of the diverse set of regularization methods explored in the standard literature (Louis 1989; Hansen 1992, 1998; Kress 1999; Engl et al. 2000), although Monte Carlo methods have been used for estimating the regularization parameters of standard regularization functionals (Ramani et al. 2008). The concept of a statistical distribution of parameter values should be confused neither with the real population distribution of coexisting species in the sample mixture nor the estimate of the latter in the form of a calculated size-and-shape distribution. Nevertheless, one could ask to what extent one can rely on the statistical nature of the noise in the data, in combination with noise amplification, to produce a parsimonious two-dimensional histogram of (s, fr)-species populations. As an example, we compare in Fig. 6 the standard analysis of our model data with the 10 × 100 grid and Tikhonov regularization ((6a)6a) with the histogram of all distributions from 50 Monte Carlo iterations (each based on an exact standard fit with the 10 × 100 grid; grid;6b).6b). After 50 iterations the histogram clearly shows multimodal and spiky behavior suggesting the presence of multiple species, in contrast to the single broad peak representing the smoothest solution of all that fit the originally measured data. Thus, the 50 Monte Carlo iterations do not provide an effective means to correctly identify the information content of the data. If, on the other hand, we are independently knowledgeable about the monodisperse nature of the sample, we can use the Bayesian approach (Brown et al. 2007) to calculate the size-and-shape distribution that is closest to a single peak, and these results are shown, for comparison, in Fig. 6c.
In the present letter, we have examined the different algorithmic elements that were conceived and applied in the recently suggested “2DSA” size-and-shape distribution by Brookes et al. (2009). We have compared this with the standard approach that is well established for solving ill-posed integral equations problems in many fields, which rests on well-established linear algebra and related numerical tools of linear least-squares analysis. Contrary to the assertion of Brookes, Cao, and Demeler, the application of the standard approach to the size-and-shape distribution problem is quite feasible on ordinary laboratory computers within only a few minutes of computation time, even when using high-resolution grids suitable to fully extract the experimental information content. As implemented in SEDFIT, this approach is being applied in many laboratories (Markossian et al. 2006; Chang et al. 2007; Deng et al. 2007; Race et al. 2007; Broomell et al. 2008; Brown et al. 2008; Chebotareva et al. 2008; Iseli et al. 2008; Moncrieffe et al. 2008; Paz et al. 2008; Sivakolundu et al. 2008; Wang et al. 2008; Eronina et al. 2009; Mortuza et al. 2009). Since this can supply exact (up to numerical precision) best-fit solutions, we have applied it to a data analysis example to serve as a reference solution in a study of the performance of different computational strategies on which 2DSA relies. This illustrates the consequences of the deviations from the established mathematical reference frame that should be expected to arise in Demeler’s 2DSA approach.
First, there are concerns about the accuracy of the evaluated Lamm equation solutions serving as kernel to the size-and-shape distribution integral. This could likely be addressed by deviating from the discretization parameters advocated by Cao and Demeler (2005).
Second, a more fundamental problem is the use of grids with extremely small number of points, far below the resolution required to describe the data. If, as illustrated in Fig. 3, the predicted concentration profiles from these coarse models do not even qualitatively follow the experimental data, we question whether there are any meaningful conclusions that can be drawn from these results. Brookes et al. (2009) distract from this problem by incorrectly stating that such grids were the basis of the implementation of c(s, fr) models in SEDFIT, which is well described in the literature to achieve excellent fits of the data to within their statistical noise. To the best of our knowledge the attempt to utilize coarse grids is without precedent prior to the Brookes et al. paper.
Despite the inability of these grids to describe the data, Demeler and colleagues suggest that the combination of results from the application of a large number of different, but similarly coarse, grids (all with 10 × 10 or lower resolution; Demeler et al. 2009) can be used in some way to achieve an analysis equivalent to that of a high-resolution grid. In the simplest form, this argument would be incompatible with basic matrix algebra, because it neglects cross-correlation between points from different grids. Discarding the magnitude of species’ populations in this concatenated distribution, and using only the pattern of nonnegativity constraints from such an analysis, is in conflict with the established Lawson and Hanson algorithm NNLS. The effect of the empirical extension to multiple stages is uncertain, and may not converge. Although one could construe cases where it will certainly work (such as distributions consisting of a single species), the Demeler scheme for generating nonequidistant small grids in multiple stages appears fundamentally flawed for the general case.
The strategy of sequentially applying different, equally coarse, grids is in contrast to established multigrid methods for integral equations, which provide successfully finer parameter discretization (Kress 1999). The division of the full problem into separate subproblems to be solved in parallel, followed by merging their partial solutions, has been used successfully in some image restoration problems (Bevilacqua and Piccolomini 2000) where the image regions are known to be uncorrelated with each other due to a localized point-spread function. However, this condition is not fulfilled in the present case. In SV analysis, the cross-correlation of signals from different species can be very large. This is reflected by the fact that (1) is ill posed, and illustrated by the fact that the matrices in Fig. 4b and c are different. For a correct solution of the SV problem, the regular high-resolution grid should be considered fully and unbiased by any scheme of preselection of excluded parameter regions. The latter is quite feasible with standard algorithms and commonly available computational resources, and we note that the problem is fairly small compared with many image analysis problems of similar structure.
Finally, the application of the Monte Carlo approach to achieve greater parsimony of the results (i.e., simplicity of the distribution in the sense of suppressing artificial detail) is equally novel, but not very successful when we applied this idea to our example data analysis. An example of the lack of regularization in the 2DSA method resulting in artificial detail can be found in the data shown by Planken et al. (2008), where a standard c(s) analysis of SV data with maximum-entropy regularization exhibits only a single broad skewed distribution [Fig. 3c in Planken et al. (2008)], consistent with the expected continuous size distribution of the material studied, yet the 2DSA analysis of the same data suggests the presence of more than 14 discrete peaks (at different s-values and all at similar frictional ratio) [Fig. 4 in Planken et al. (2008)]. The Monte Carlo approach is certainly an extremely computationally costly step, in particular if one would carry it out with statistically meaningful iteration numbers. In contrast, application of the standard Tikhonov regularization to the full high-resolution problem, with or without Bayesian modulation, takes a small fraction of the computational effort of the original problem, i.e., on the order of seconds on a PC.
In conclusion, we would regard the computational effort to be a secondary problem, and the choice of computational platform rather inconsequential, relative to the main concern arising from simple mathematical arguments that Demeler’s algorithm may not give correct results. The authors do qualify their algorithm to be empirical, and that “the results are not generally in exact correspondence with the original problem” (Brookes et al. 2006). They argue that “[the results] can be made sufficiently close through careful use of the given heuristics” (Brookes et al. 2006). We are uncertain of the process referred to here, and how closeness to the exact solution would possibly be assessed without explicitly calculating the exact best-fit solution. So far Demeler and colleagues have not brought forward any proof that the distributions returned by the 2DSA method are at least close in the major attributes to the correct solution. We believe that the question of correctness of the algorithm is critical, especially since the authors invite the general application of this method, as implemented in the ULTRASCAN software, to address data analysis problems in novel biophysical and biochemical studies, rather than simple model problems with known solutions.
This work was supported by the Intramural Research Program of the National Institute of Bioimaging and Bioengineering, NIH.
Open Access This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
1We also analyzed the data with ULTRASCAN II version 9.9 to confirm our results as far as possible. Unfortunately, the current lack of a manual section for the use of the 2DSA analysis and the excessive computational times involved prevented us from a direct comparative analysis of the same data with the full 2DSA model as described by Brookes et al. (2009). Further, a detailed comparison does not seem possible due to seemingly unavoidable data truncation steps when loading data in ULTRASCAN II, and due to our inability to write the entire calculated distribution into a text file.
2They cannot, however, be calculated in a first analysis and then be subtracted from the experimental data, as described by Demeler and colleagues (Brookes et al. 2009). Since systematic noise components are part of the model, and since their estimates can correlate with the description of the macromolecular sedimentation distribution, they need to be simultaneously optimized (Schuck and Demeler 1999; Dam and Schuck 2004).
3Scaling this to the processor clock rate of a G1 smartphone, we expect this calculation should take less than 1 h, which would still compare well with the experimental time of several hours.
4As described, for example, by Demeler et al. (2009): “Typically, we apply 100–300 grid movings of a 10 × 10 grid to obtain a resolution that is commensurate with the resolution of the analytical ultracentrifuge.” and “Solutes with positive amplitudes from different grids are then unioned with each other to form new grids with a maximum number of solutes equivalent to that of a single initial grid (generally less than 100 solutes).”