The models of coupled equilibria
A symbolic representation of the pictorial two-state equilibria in is shown in . Following the nomenclature of the MWC model (Monod et al. 1963
), the two states of the equilibrium intrinsic to the apo protein () or the biochemically isolated DH domain () are referred to as the T and R states, corresponding to the square and circle, respectively. The fractional populations for states T and R are
sum to one.
are two first-order microscopic rate constants.
A symbolic representation of the pictorial four-state equilibria from is shown in . Ligand binding to states T and R gives states TB and RB, respectively, which can also interconvert. These interactions produce a closed (as distinct from linear) four-state equilibrium (). The model assumes that microscopically the T–R transition and effector binding happen instantaneously and that diagonal transitions involving simultaneous changes in both processes are too rare to be observed (Woessner 1961
). The fractional populations for states T, TB, RB, and R are pT
, and pR
, respectively. pT
, and pR
sum to one. For the unimolecular systems (), k1
are eight first-order microscopic rate constants. For the bimolecular systems (), k3
contain a term of free effector concentration (Palmer et al. 2001
Design of a novel data analysis strategy
The first step in curve fitting is to establish a real-valued target function (TF) based on quantitative relationships between the data and the adjustable variables (x
). In the absence of noise, the true values of the adjustable variables (X
) is the global minimizer, a point x
** such that
for all x
. Random noise modulates the TF such that X
is no longer the global minimizer. More often, X
is not even a local minimizer, a point x
* such that
for all x
*. In the presence of random noise, the best one can expect from curve fitting is to find the local minimizer that is the closest to X
and use it to approximate X
. Therefore, in the second step of curve fitting, TFs are minimized to obtain as many local minimizers as possible using optimization algorithms. The next step is to determine which one, if any, of the local minimizers obtained is the solution—i.e., the minimizer closest to X
. In the absence of other orthogonal information, it is often assumed that the global minimizer is the solution. Whether this assumption holds depends on the roughness of the target function surface (TFS), which is dictated by the quality and quantity of the data available and by the complexity of the models. When the function is complex, there may be many local minima of comparable TFV, and the global minimizer may or may not be that closest to X
. Since there is no a priori information about X
, it is not justifiable to attribute the global minimizer, if obtainable, to the solution in scenarios where the noise-modulated TFS is rough. This is likely the case in solving the four-state equilibrium. Before attempting to solve such problems, it is necessary to probe the roughness of TFS to assess the feasibility of the fitting.
Minima mapping from grids of reduced dimensions: evaluating the feasibility of solving a four-state system
The four-state model is complicated [(1
) and (2
)] and the number of adjustable parameters used for fitting relaxation dispersion data is large [(3
) and (4
)]. In any parameter set, there are 4 kinetic, 3 population, 3n
chemical shift (3 for each spin, the fourth is set at zero since the chemical shift differences dictate relaxation dispersion), and 2n
variables, where n
is the number of resonances in fast-to-intermediate exchange for which CPMG data of high quality can be obtained. The dimensionality of the system (d
) increases with n
. In our benchmark n
= 8, corresponding to a total of 24 chemical shift, and 16
variables (). Since it is not practical to perform unconstrained probing of the roughness of the TFS for systems of such high dimensionality (d
= 47), we devised a strategy to reduce the dimensionality of the search by using NMR information that is amenable to experimental perturbation.
Grid search is a commonly used method to probe and evaluate the TFS (Thisted 1988
; Wen and Hsiao 2007
). This approach involves setting up grids of all adjustable variables and evaluating the TFV at each grid point. Even if each parameter were to take only two different values, with the data here there are c.a. 247
) grid points for the TFV calculation. Moreover, there are no intuitive ways to analyze the calculated TFV due to the high dimensionality. We designed a data analysis protocol to overcome these difficulties. First, the grid search dimension was dramatically reduced by fixing the initial values for most of the adjustable parameters and only varying the initial values of a few key adjustable parameters. Since
variables are orthogonal to all other variables, it is easy to fit them (Ishima and Torchia 2005
), and each of them was simply assigned with a reasonable initial guess. In many four-state systems, one two-state edge can be isolated by removing domains or by making point mutations that strongly skew one of the two equilibrium processes. Alternatively, both equilibria can be modulated together, generating new variants of the four-state system with different collections of rates and populations (Li et al. 2008
). Often, chemical shift and relaxation dispersion analyses of these modified systems will provide qualitative estimates of the chemical shifts of each spin in the different states (Li et al. 2008
; Yu et al. 2010
). In our analyses here, initial values of all 24 chemical shifts were based on our chemical shift perturbation and relaxation dispersion analyses of a battery of mutant helix-DH proteins (Li et al. 2008
). Initial values of all four kex
s were set to be identical and varied between e5.0
, corresponding to slow and intermediate/fast exchange regimes, respectively. Initial values of pT
were given based on chemical shift (Li et al. 2008
) and relaxation dispersion analyses of the helix-DH protein and its variants, and initial values for pTB
were varied (see “Materials and Methods” for details). These treatments reduce the dimensionality of the grid search to 2,1
one along pTB
and the other along the kex
s. We note that all adjustable variables were, and must be, allowed to change during the actual optimization process; only the initial values of the adjustable variables at the start of optimization are fixed here.
A second modification was that instead of calculating the TFV at each grid point as done in standard grid search algorithms (Thisted 1988
; Wen and Hsiao 2007
), a minimum on the TFS was obtained from each grid point via the Levenberg–Marquardt nonlinear optimization algorithm. Theoretically, all minima of a TFS can be mapped if the optimization procedure is carried out from a sufficient number of well-designed grid points. If the number of distinct minima is high, it is not feasible to solve the four-state equilibrium unless further information is incorporated. This curve fitting strategy, modified from standard grid-search algorithms, is referred to as target function minima mapping from grids of reduced dimensions (M2GRED).
Logarithm of average normalized distance
To sufficiently map minima on the TFS, M2GRED analysis will generate a large number of minimizers: some of them are identical and others are distinct from one another. Relationships among them need to be evaluated to assess minima map of the TFS. However, these minimizers are often multi-dimensional vectors, in which there are different types of variables with utterly different absolute values. Therefore, it is usually not straightforward to assess the relationship between any pair of minimizers unless they are identical. To effectively evaluate the TFS, a quantity termed logarithm of average normalized distance (LAND) was designed to assess similarities among the multidimensional minimizers:
in which M1 and M2 are two minimizers being compared, i
indices are the adjustable variables in the n
dimensional vectors. The LAND is formulated based on the Euclidean distance between two n-dimensional vectors, M1 and M2, with two levels of normalization. For each element pair (M1(i
)), their squared deviation, (M1(i
, is divided by a normalization factor, (|M1(i
)| + |M2(i
, which is always equal or larger than (M1(i
. The normalization enables co-existence of variables of different amplitudes in the vectors being compared. The sum of the normalized squared deviations is divided by the dimensionality of the vectors, n
. This normalization enables comparison of vectors of different dimensionalities. The logarithm calculation is used to improve the sensitivity of the measure to smaller differences. For m
minimizers under investigation, pair-wise LAND calculation results in an m
symmetric matrix. The more similar two minimizers are, the smaller the LAND is. To avoid taking the logarithm of zero on the diagonal, all average normalized distances below or equal to 10−1.5
are set to 10−1.5
, corresponding to a LAND value of −1.5. The choice of this cut-off is arbitrary. Nevertheless, a LAND value of −1.5 roughly corresponds to average 6.3% (= 2 × 10−1.5
) deviation per adjustable parameter. Such deviation is smaller than experimental deviation levels in many biochemical and biophysical applications. We note that LAND is designed to evaluate similarities between minimizers and does not take into account correlations between adjustable variables.
M2GRED/LAND analysis guide curve fitting of relaxation dispersion data
The availability of the novel data analysis strategy allows effective evaluation of the smoothness of the TFS when different amounts of orthogonal information are incorporated in curve fitting. M2GRED analysis was first performed on the benchmark relaxation dispersion data for all adjustable variables of the four-state model and 210 minimizers were obtained. Their TFVs range from 284.8 to 357.5 (). Most of the minima possess distinct TFVs, indicating that the TFS possesses numerous local minima. A LAND matrix of a 7-dimensional vector containing 4 kex
and 3 population values is plotted in , and a LAND matrix of a 24-dimensional vector containing 24 chemical shifts is shown in . LAND values decrease from red to blue (see color bars in ). A blue cluster in the matrix indicates that the minimizers in it are similar to one another. As shown in , there are very few blue clusters in either matrix, confirming that most of the grid points lead to distinct minima—i.e., the TFS is very rough. Since previous work on three-state systems showed that simple data were not sufficient to define parameters (Korzhnev et al. 2004
; Neudecker et al. 2006
), this level of roughness is expected after fitting the benchmark relaxation dispersion data to all adjustable parameters of the four-state model.
Fig. 3 Curve fitting results of benchmark CPMG data. Fitting of the benchmark CPMG data to all adjustable variables of the four-state model (a–c), and all adjustable variables except those on the T-R edge (d–g). a TFV values of the 210 minimizers. (more ...)
Clearly, orthogonal information is required to solve the benchmark four-state equilibrium. In many allosteric four-state systems, at least one edge of the four can be isolated through biochemical manipulations, to reduce the system to two states (). For example, in helix-DH, the intrinsic dynamics in the DH domain can be isolated through analysis of the isolated DH protein lacking the inhibitory helix (Li et al., in preparation). This then enables the adjustable parameters of the isolatable edge to be determined independently and used as constraints in the four-state curve fitting. To mimic this in the simulations here, the kinetic parameter (kex1), the population ratio (pT/pR), and the chemical shift differences (|ΩT – ΩR|) on the T-R edge were fixed in further curve fitting. As a result, the numbers of kinetic, population, and chemical shift adjustable variables are reduced from 4, 3, and 24 to 3 (kex2, kex3, kex4), 2 (pTB, pRB), and 16, respectively.
M2GRED analysis was then carried out on the bench-mark CPMG data set, and 240 minimizers were obtained for LAND analysis. TFVs of the minimizers are shown in . The LAND matrix of a 5-dimensional kinetic/population vector and that of a 16-dimensional chemical shift vector are shown in , respectively. In the former matrix, there is one major cluster of low TFV (blue; minimizers 1–150 in ), meaning that the majority of the minimizers have similar fitted kinetic and thermodynamic variables. In the chemical shift LAND plot, the low TFV cluster (minimizers 1–150) diverges into a large one and some minor ones (), suggesting that chemical shifts suffer from greater degeneracy compared with kinetic and thermodynamic parameters.
To assess the relationship between the minimizers and the solution—i.e., the one closest to the true solution—the LAND values between the 21-dimentional vector (containing 3 kinetic, 2 thermodynamic, 16 chemical shift variables) of all 240 minimizers and that of the input vector were calculated (). Minimizers in the large blue cluster on the chemical shift plot (minimizers 21–137) are the closest to the input vector (the basin on the curve in ), indicating that it is, by definition, the solution.
Solution determination using chemical shift constraints
It is important to note that in the chemical shift LAND matrix there are 20 minimizers that have TFVs (between 294.9 and 295.6) slightly lower than that of the solution (295.9) (). Thus, the TFV alone is still insufficient to determine the complete solution for the four-state equilibria, even with constraints on adjustable variables on one edge. Since foreknowledge of the solution would be unavailable in experimental applications, further information is needed to pinpoint the solution. To this end, independent chemical shift information is available from HSQC measurements for each spin in both the two-state systems and the four-state systems. In principle, this information could be used as constraints in curve fitting. But this would require knowledge of the relative sign of the chemical shifts of a spin in all four states. While the relative sign can be determined for two-state systems (Skrynnikov et al. 2002
), it is challenging to determine for four-state systems. Therefore, we use the chemical shift information to identify the solution from the resulting minimizers instead. For each spin, one can independently determine the difference between its chemical shift (CSD) in the two-state system and in the four-state system. If a minimizer is the solution, its fitted population and chemical shifts should recapitulate the experimentally measured CSD. To utilize this information, the CSD was calculated for each spin using the input populations and chemical shifts according to (6
to mimic data available from HSQC spectra (CSDHSQC
). CSD was also calculated for each spin using the fitted populations and chemical shift variables in all minimizers based on the same equation (CSDCPMG
If a minimizer approximates the solution closely, the CSDCPMG value for each spin determined from it will predict CSDHSQC well—i.e., both the slope and the correlation coefficient (r2) of the linear correlation for all spins in a minimizer will approach unity. We determined the CSDCPMG and CSDHSQC values for the eight spins in each of the 240 minimizers and calculated the slopes and r2 of the resulting 240 linear correlations (). A total of 117 minimizers have r2 values greater than 0.94 (vertical dotted line) (0.947–0.952) and slopes close to unity (horizontal dotted line) (0.945) (inset of ). The 117 minimizers coincide exactly with the large blue cluster of the chemical shift plot—i.e., the solution (). The plot of CSDHSQC versus CSDCPMG of the solution is shown in . There are also 20 minimizers that have TFVs slightly lower than that of the solution (). The first four of these minimizers have either r2 less than 0.8 or slope less than 0.6 and are therefore out of the axis limits of . Minimizers 5–20 have r2 between 0.857 and 0.866, and slope between 0.908 and 0.918 (red dotted circle in ), and are clearly distinguishable from the cluster corresponding to the solution. These results collectively suggest that CSD analysis in conjunction with the TFV is sufficient to identify the solution from all minimizers obtained for the benchmark parameter set.
Fig. 4 Determining the solution using chemical shift differences. a Plot of the slopes and the correlation coefficients (r2) of the linear correlations of chemical shift perturbations calculated from the minimizers (CSDCPMG) with chemical shift perturbations (more ...)
The kinetic and thermodynamic parameters in the solution are shown together with their input values in . Among the five adjustable kinetic and thermodynamic variables, kex3 and pTB are accurately (close to their input values) and precisely (small uncertainties) determined; kex2 is determined with less accuracy, while kex4 and pRB are determined with less precision. Monte Carlo (MC) simulation was used to assess which adjustable parameters can be faithfully fitted and which cannot in the presence of random noise. In each simulation, 4% random noise was incorporated into the noise-free benchmark data set. Ideally M2GRED/LAND analysis should be employed to find the solution for each data set. However, this is extremely time-consuming for a large number of simulations. Instead, the minimizer obtained from the input parameter vector as the initial guess was assumed to be the solution. A total of 1,000 MC simulations were carried out. The means and standard deviations of fitted kinetic/thermodynamic variables are shown in . Among them, kex3 and pTB are accurately and precisely determined as they contribute significantly to relaxation dispersion, kex2 and pRB are reasonably defined, and kex4 is poorly determined presumably due to its marginal contribution to relaxation owing to the low-populated R-RB edge (pR = 0.03, pRB = 0.05). Nevertheless, the MC simulation results are statistically indistinguishable from the solution ().
In summary, when combined with relaxation dispersion data acquired at two fields, CSD information and constraints of variables on one edge are sufficient to determine the solution of the benchmark four-state equilibrium in the presence of experimentally reasonable noise. We note that chemical shift analysis for slow exchange scenarios can help identify solutions for chemical shifts but not for population values since chemical shifts do not provide thermodynamic information in this exchange regime. This information can in principle be extracted from relative peak volumes. However, due to line broadening and/or population bias, it is often difficult to rely on these measurements for extraction of thermodynamic information. Further analyses are necessary to assess the value of independent chemical shift information in identifying the solution of a four-state equilibrium in slow exchange.
Improvement in fitted parameters by additional measurements
It is instructive to assess how different types of additional data affect the quality of the fitting, as a guide to experimental design. We compared the benefits of repeating the measurements at 600 and 800 MHz to those of adding 900 MHz data to the benchmark. In the former case, the adjustable parameters remain unchanged. In the latter, there are 8 additional R20 values due to the introduction of a new field. The two new data sets were subjected to 1,000 MC simulations at a 4% noise level. In both cases the incorporation of additional data modestly improved both the accuracy and the precision of fitted kinetic and thermodynamic variables (data not shown). The two already well-determined kinetic parameters, kex2 and kex3, show the largest relative improvement. For example, the standard deviation of kex2 decreases more than 60% when 900 MHz data are incorporated. The two most poorly determined parameters, kex4 and pRB, only show slight improvement in their precision upon both treatments. Overall, incorporation of 900-MHz data is somewhat superior to repeating measurements at the two lower fields since this information gives larger improvements of precision in 4 out of 5 kinetic/thermodynamic variables.
MC simulation on all parameter sets
In order to more comprehensively survey the parameter space in which it is feasible to solve a four-state equilibrium, we tested the fitting for all 624 parameter sets described in “The parameter sets” using MC simulation analysis. For each parameter set, we added 4% noise to synthetic 600- and 800-MHz CPMG data and carried out one hundred MC simulations. In each case, kex1 and the pT/pR ratio were fixed; fitting was initiated with the starting parameter set and the resulting minimizer is assumed to be the solution based on previous observation (data not shown). The means and standard deviations of the normalized fitted kinetic and thermodynamic variables (kex2, kex4, kex3, pTB, and pRB) are shown in Figs. and . Using normalized values one can readily assess the accuracy (deviation of the normalized mean from 1) and precision (size of the normalized standard deviation) of the fitting across a large range of absolute values. For each adjustable variable, we plotted fitted values from all 624 parameter sets sequentially organized as described as follows: within each 13-column colored sector, all the input values remain constant except for input pR (pT) values which decrease (increase) from left to right, following the order in ; different colors represent different input pTB (pRB) values, which decrease (increase) from red to blue sectors as in ; each of the tri-colored (red, green, blue) supersectors are each composed of a total of 39 columns, and correspond to a unique set of input kex1, kex2, kex3, and kex4 values and are sequentially organized following the order (from top to bottom) in .
Fig. 5 Fitted kinetic and thermodynamic variables. Fitted kex2 (a), kex4 (b), kex3 (c), pTB (d), and pRB (e) are shown for all 624 parameter sets. Each parameter is normalized to its input value. The error bars are the SD of the normalized fitted parameters. (more ...)
Fig. 6 Fitted kinetic and thermodynamic parameters. For clarity, fitted kex2 (a), kex4 (b), kex3 (c), pTB (d), and pRB (e) of the first 39 parameter sets in are shown. The three wedges above a show ascending order of the ratios between pT to pR in each (more ...)
In curve fitting, a variable can only be well defined if it contributes significantly to the measurements. As shown in Figs. and , kex2 is reasonably determined in the later part of each colored sector, corresponding to conditions where pT is higher, and the T-TB edge thus makes a larger contributions to relaxation. In contrast, Figs. and show that kex4 is reasonably determined in the early part of each of the colored sectors. In these parameter sets, pR is higher, resulting in larger contributions to relaxation from the R-RB edge. Owing to the opposite responses to alterations in pT/pR, there is a negative correlation between the certainty of kex2 and kex4 (compare Figs. , ). Both the accuracy and precision of the fitted kex3 are reduced when (pTB, pRB) = (0.88, 0.02) and pT is high as shown in 9th, 11th, 13th, and 15th red sectors (parameter sets 313–325, 391–403, 469–481, 547–559, respectively) in . This is because, as pRB becomes small and pT becomes high, the T-TB edge replaces the TB-RB edge as the relaxation-dominating edge. kex3 is very well deter-mined when the TB-RB edge is less skewed—i.e., (pTB, pRB) = (0.85, 0.05) or (0.8, 0.1) (green and blue sectors in Figs. and )—as the edge dominates relaxation in these cases. In principle, an edge can become relaxation-dominating if its kex value switches from the fast to intermediate exchange regime. However, there are no such examples in the current parameter sets.
As far as adjustable population variables are concerned, pTB is very well determined since it is the major species and always on edges that contribute to relaxation significantly (Figs. and ); pRB is better determined if the TB-RB edge is less skewed [blue sectors of , (pTB, pRB) = (0.8, 0.1)], and it tends to be over-estimated if the TB-RB edge is severely skewed [red sectors of , (pTB, pRB) = (0.88, 0.02)]. Therefore, the ratio of pTB/pRB tends to be accurately estimated when the TB-RB edge is less skewed, and underestimated when the TB-RB edge is severely skewed.
In principle there are two ways to calculate the coupling between the equilibria in the four-state box: (pT
) or (pT
). However, since the ratio, pT
, is determined in our approach through a biochemically-created two-state system, it can be quantified most accurately. This speaks to calculating coupling through the latter measure. From the analysis in the preceding paragraph, the ratio (pT
) can be most accurately determined when the TB-RB edge is less skewed, but will have systematic deviations if the TB-RB edge is severely skewed. Nevertheless, the systematic deviation may be especially significant for signaling proteins, in which the coupling strengths tend to be small, ranging from a several-fold to tens of folds (Hantschel et al. 2003
; Lietha et al. 2007
; Moarefi et al. 1997
; Pluskey et al. 1995
; Yohe et al. 2008
; Yu et al. 2010