|Home | About | Journals | Submit | Contact Us | Français|
A major challenge in structural biology is to determine the configuration of domains and proteins in multi-domain proteins and assemblies, respectively. To maximize the accuracy and precision of these models, all available data should be considered. Small angle x-ray scattering (SAXS) efficiently provides low-resolution experimental data about the shapes of proteins and their assemblies. Thus, we integrated SAXS profiles into our software for modeling proteins and their assemblies by satisfaction of spatial restraints. Specifically, we model the quaternary structures of multidomain proteins with structurally defined rigid domains as well as quaternary structures of binary complexes of structurally defined rigid proteins. In addition to SAXS profiles and the component structures, we employ stereochemical restraints and an atomic distance-dependent statistical potential. The scoring function is optimized by a biased Monte Carlo protocol, including quasi-Newton and simulated annealing schemes. The final prediction corresponds to the best scoring solution in the largest cluster of many independently calculated solutions. To quantify how well the quaternary structures are determined based on their SAXS profiles, we used a benchmark of 12 simulated examples as well as an experimental SAXS profile of the homo-tetramer D-xylose isomerase. Optimization of the SAXS-dependent scoring function generally results in accurate models, if sufficiently precise approximations for the constituent rigid bodies are available; otherwise, the best scoring models can have significant errors. Thus, SAXS profiles can play a useful role in the structural characterization of proteins and assemblies, if they are combined with additional data and used judiciously. Our integration of a SAXS profile into modeling by satisfaction of spatial restraints will facilitate further integration of different kinds of data for structure determination of proteins and their assemblies.
A comprehensive structural description of proteins, nucleic acids, and their assemblies will help us discover the principles that underlie cellular processes and bridge the gaps between genome sequencing, functional genomics, proteomics, and systems biology 1; 2. While X-ray crystallography and NMR spectroscopy can provide accurate high-resolution structures, these methods are limited by the difficulties in protein purification, stability of large complexes, crystallization (X-ray), and size (NMR). Single particle cryo-electron microscopy (cryo-EM) generally does not provide atomic-resolution structures and currently cannot be applied to systems smaller than approximately 250 kDa. While efficient, computational protein structure prediction methods are limited by their accuracy. These difficulties may be overcome by computational methods that effectively combine experimental, theoretical, and statistical information 2; 3.
Small angle X-ray scattering (SAXS) can rapidly provide low-resolution information about the shape of a macromolecule or a complex in solution 4; 5; 6; 7. A SAXS measurement determines the molecule's rotationally averaged scattering intensity as a function of spatial frequency, I(q), typically at 1-3 nm resolution 5; 7. This profile can be readily transformed into an electron pair distance distribution function P(r), which is essentially a histogram of all pairwise distances r of the electrons in the sample. Due to the rotational averaging, the information content of a SAXS profile is dramatically reduced compared to an X-ray crystallographic diffraction pattern or even a density map from cryo-EM. However, one of the advantages of SAXS is its applicability to a wide range of assemblies; its applications range from DNA fragments 8 to whole virions 9. Moreover, data collection and processing are very rapid (typically, from seconds to minutes), allowing high-throughput analyses of a large number of samples at many conditions. These advantages are in stark contrast to crystallography, cryo-EM, and NMR spectroscopy. The ease of altering solution conditions makes SAXS ideal for mapping structural differences between varied conformational states of a macromolecular system; if the structure of one conformational state is known, even the relatively sparse information content of a SAXS profile can be sufficient to determine structural rearrangements, such as hinge-motions in proteins 10; 11; 12.
Information from SAXS can in principle be incorporated into the modeling process in two ways. First, a SAXS profile can be used to assess different models that were produced based on other considerations. For example, experimental SAXS profiles have been used to choose one of the many different quaternary structure arrangements produced by molecular docking of the son-of-sevenless domains 13. Similarly, simulations indicated that SAXS profiles can be used to choose a close-to-native solution from a large set of alternative homology models of a given protein 14.
Second, a SAXS profile can be used during the model building stage itself. The first such calculation of a model based on a SAXS profile relied on representing a macromolecular surface using spherical harmonics 15. However, this representation has a relatively low resolution, thus leading to the development of alternative methods. Due to the limited information content of SAXS profiles, virtually all subsequently developed methods have aimed to integrate additional information into structure determination to reduce the manifold of solutions consistent with a given SAXS profile to a usefully small number.
Depending on the nature and resolution of additional information, different representations have been proposed. Early approaches modeled the molecule's envelope enforcing compactness 16. Early coarse-grained approaches represented the macromolecule as an assembly of unconnected beads on a grid 17; 18; 19. This representation enforces an overall mass by using a required number of beads; geometrical symmetry may also be incorporated through symmetric sampling. In addition, compactness of the models is ensured by restricting the sampling to the vicinity of a compact initial model 17; 18 or by including appropriate terms into the scoring function 19; 20. Other modeling approaches represent a protein as a chain of beads rather than a set of disconnected beads on a grid 20. In a recent application, atomic models were fitted into 6-fold symmetric bead reconstructions to gain insights into domain rearrangements of the AAA-ATPase p97 during its functional cycle 10. Higher resolution a priori structural information about some parts of the protein can also be integrated into the modeling process by focusing the conformational sampling only on the undefined segments, such as loops 21, or on the configuration of structurally defined domains and their flexible linkers 22. For example, rigid body modeling was applied to give qualitative insights into the conformation of the polypyrimidine binding protein 23. Recently, SAXS profiles have been incorporated into the modeling of protein structures based on NMR-derived restraints, which significantly increased the accuracy of models for multidomain proteins compared to the models based on NMR profiles alone 24. In addition, the inclusion of simulated SAXS profiles into folding simulations led to models of small helical proteins 25.
Here, we describe the newly developed SAXS module in Integrative Modeling Platform (IMP), our software platform for modeling macromolecular assemblies by satisfaction of spatial restraints (http://salilab.org/imp) 3; 26. This integration in turn allowed us to combine SAXS profiles with various other types of data already used by IMP. Specifically, we present a protocol for modeling multidomain proteins and complexes. In particular, the protocol calculates the quaternary structures of multidomain proteins with structurally defined rigid domains as well as quaternary structures for complexes of structurally defined rigid proteins. In addition to a SAXS profile and rigid-body constraints, we employ stereochemical restraints derived from a molecular mechanics force field 27, a simple non-bonded atom pair term 28, the atomistic distance-dependent statistical potential DOPE 29, and an optional symmetry-enforcing term 26. We quantify the performance of the protocol using a benchmark of simulated examples. Finally, to test the method in a realistic setting, we also model the quaternary structure of the homotetratmer D-xylose isomerase (XI) based on an experimental SAXS profile. The method already revealed large domain rearrangements between the nucleotide-free and the nucleotide-bound forms of Escherichia coli Hsp90 12.
We developed a method to calculate atomic models of proteins and their assemblies that are consistent with experimental SAXS profiles as well as other spatial restraints. The solution is found by optimizing a scoring function that quantifies how consistent a model is with the SAXS profile and the other restraints. Next, the method is described by specifying its three components: (i) the representation of the modeled system, (ii) the terms that contribute to the scoring function, and (iii) the optimization protocol. We also describe how to evaluate the ensemble of models obtained from independent optimizations of the scoring function.
The system is represented by its Nat non-hydrogen atoms. A major problem that needs to be overcome is the large size of the search space compared to the amount of input data. To reduce the number of degrees of freedom, the model is partitioned into L rigid bodies, bl, l = 1,2,…, L ≤ Nat. Each rigid body is characterized by its center of mass, and its orientation . The orientation is measured by three rotations (θx, θy, θz) around the x-, y-, and z-axes of the coordinate system with its origin at the center of mass. A rigid body can be any set of atoms, including a single atom, a secondary-structure element, a domain, or the whole protein. Each atom of the whole macromolecule is a member of exactly one rigid body. The two extreme cases correspond to either the entire structure or each atom being a rigid body. Here, in our applications to proteins, we model user-defined domains as rigid bodies while the connecting linkers are flexible. In our application to assemblies of proteins, we model the whole proteins as rigid bodies.
The scoring function is defined as:
Sstereo accounts for the inter-rigid body stereochemical features of the atoms in the protein (ie, chemical bond lengths, bond angles, etc) according to the CHARMM molecular mechanics force field 27. The term Soverlap penalizes steric clashes in the model 28; a harmonic potential penalizes distances between two atoms if they are closer than the sum of their van der Waals radii. The term SDOPE is an atomic distance-dependent statistical potential called Discrete Optimized Protein Energy (DOPE) 29. Additionally, a symmetry restraint Ssym can be imposed on the arrangement of the rigid bodies 26: Ssym is the sum of the root-mean-squares of the differences between equivalent distances in two symmetry units. The term SSAXS accounts for the SAXS profile and is described in detail next.
SAXS experiments measure the rotationally averaged X-ray scattering of the macromolecule under scrutiny. The measured quantity is a one-dimensional curve that gives the scattered intensity Iexp as a function of the momentum transfer q=(4π/λ)sin(θ), where λ is the wavelength of the incident X-ray beam and 2θ is the scattering angle.
Similar to previous approaches to modeling a macromolecule based on its SAXS profile 17; 18; 20; 24, we score a model based on the deviation between the calculated (Im) and experimental SAXS profiles (Iexp):
where k denotes the index of the measured frequency q, Q is the total number of frequencies, and σexp is the experimental error. The relative scaling of Im with respect to Iexp cannot be determined precisely because the protein concentrations generally cannot be measured with sufficient accuracy. Thus, the profile Im is scaled by a constant c, which is chosen by minimizing χ2 (Supplementary Theory and Methods).
Moreover, we require SSAXS to be comparable in size to the other four types of terms in S. We used the following rationale: Soverlap, Sstereo, SDOPE, and Ssym are extensive functions in that they increase proportionally to the number of atoms Nat in the model. For the native structure, the value of χ2 from Eq. (2) is on the order of 1. We choose the SAXS penalty of the native structure to be of comparable magnitude as the internal potential energy, defining the scaling as follows:
where kB is the Boltzmann constant and T=100 K.
The experimental SAXS profile of a macromolecule in solution is the difference between the scattering of the solution with and without the protein. Thus, the profile is approximately equal to the difference between the scattering intensities of the macromolecule and the solvent in the same volume. This model neglects the approximately 3 Å thin hydration layer around the macromolecule, which has a slightly higher density than bulk water. We neglected the hydration layer because the density difference between the hydration layer and bulk water is relatively small (<0.060 e-Å-3) compared to the densities of water (0.334 e-Å-3) and protein (approximately 0.440 e-Å-3) and because the volume of the hydration layer is small compared to that of the macromolecule. We assess the error due to neglecting the hydration layer in Results by comparing our profiles to those using the program CRYSOL, which does account for the hydration layer.
We utilize the Debye Formula 30 to calculate the SAXS profile Im of a given atomic model of a protein or assembly:
where fi(q) and fj(q) are the isotropic atomic form factors of the atoms i and j, respectively, and dij is the Euclidean distance between these atoms. In solution, the scattering of the protein is reduced by the contribution of the solvent in the protein excluded volume; we followed a previous formulation 31 to account for this effect (Supplementary Theory and Methods). To accelerate computation, we did not include hydrogen atoms explicitly; their vacuum scattering and their excluded volumes were added to those of their bound atoms.
The derivatives with respect to y and z are equivalent to those for x.
The calculation of both χ2 and its derivatives is computationally demanding. However, we can substantially shorten the computation time for both quantities by approximating the modeled system with atoms of different scattering masses but equal shape (Supplementary Theory and Methods). Using this approximation, the computation time is reduced by a factor equal to the number of frequencies at which I(q) is sampled, typically 100. Even at the resolution of qmax=1Å-1, the loss in accuracy due to this approximation is marginal: The deviation between a profile calculated with and without the approximation is of the order of χ2 =10-3, which is at least two orders of magnitude below typical optimal χ2 scores for our models (Supplementary Theory and Methods).
The rigidity constraints imply that atoms belonging to the same rigid body bl are not allowed to move with respect to each other. Thus, all forces within bl must be zero, which is achieved for SSAXS by excluding all atoms that are part of the same rigid body as atom i from the summation in Eq. (5).
In IMP, we provide the option to add more than one SAXS profile term to S (Supplementary Theory and Methods). In addition to the profile of the macromolecule under scrutiny, SAXS profiles of the gold-labeled macromolecule or profiles of subsets of the original system can be acquired. When the corresponding structures are conserved in these constructs, these profiles provide additional information about the macromolecule.
We optimize the model with respect to S to obtain an ensemble of good-scoring solutions (Fig. 1). To sample different minima of S, many independent optimizations are carried out starting from different random initial configurations. We differentiate between the global search mode, in which we sample 1000 different initial configurations, and the local search mode, in which we sample 100 initial configurations in the vicinity of a specific configuration. In both cases, the resulting models are clustered to make the final prediction.
Starting from an initial model of a monomeric protein or a complex (eg, a crystal structure or a homology model), the domains and the connecting linker residues (monomeric protein) or the proteins (complex) are individually rotated and translated by random values ΔΦ and ΔT, respectively. Depending on the magnitude of these parameters, the protocol explores the intermediate neighborhood of the initial model (local search mode) or the whole space of solutions (global search mode). We chose ΔT=40Å and ΔΦ=180° for global sampling and ΔT=1Å and ΔΦ =2° for local sampling. In the case of monomers, the domains are initially brought into relative vicinity by optimization with respect to Sstereo using the method of conjugate gradients 32, which reduces the sampling problem at a negligible computational cost.
Next, in Stage I of optimization, a coarse relaxation of the model is carried out with 10 cycles of a quasi-Newton method 33, relying on the scoring function consisting only of Sstereo, Soverlap, Ssym, and SSAXS; the weight of Soverlap is gradually increased from 0.005 to 1. The low initial weight of Soverlap, allows large variations of the model during the initial iterations.
In Stage II, we further optimize the model with respect to the sum of Sstereo, Soverlap, Ssym, and SSAXS using a biased Monte Carlo algorithm: all atoms of the model are randomly displaced with a standard deviation of 5 Å, maintaining the rigid-body constraints. This perturbed model is then relaxed using one cycle of quasi-Newton optimization. Twenty-five Monte-Carlo steps are carried out, decreasing the temperature exponentially. At each step, the resulting model is accepted or rejected according to the Metropolis criterion.
In Stage III, additional 25 Monte-Carlo steps are finally carried out to optimize the models with respect to the complete S, including SDOPE.
On average, more than 200 different models are generated for each of the 1,000 initial independent optimizations, resulting in the total of ~200,000 evaluated models. On a Linux-based cluster of 100 processors, the computations take from a few hours to two days, depending on the size of the modeled system.
Because each optimization of a random starting configuration samples only a fraction of the configuration space, it generally ends in a local minimum. Thus, we rank all solutions by SSAXS and retain only the top 10% for further analysis. These solutions are hierarchically clustered 34 with MATLAB (The Mathworks Inc.), relying on Cα root-mean-square deviations (Cα RMSDs) for all pairs of structures. Each Cα RMSD is obtained by superposing the largest domains and then computing the deviation between the whole structures. As a threshold level t for clustering, we use the resolution of the experimental SAXS profile, t=2π/qmax.
To assess the method with statistical rigor, we applied it to a benchmark set of 12 known structures with calculated SAXS profiles. The benchmark includes 9 multidomain proteins (Tables 1, S1); their domains are treated as rigid bodies in our calculations. The domain definitions for the native structures were taken from the CATH database. Domains for five of the 9 proteins were represented by experimentally determined structures with identical sequences in a different state. Domains for the other 4 proteins were modeled by comparative modeling based on related structures. The alignments for comparative modeling were obtained from our comprehensive database of structural alignments, DBAli 35. The models were built with the ‘automodel’ class in MODELLER-9.0. Except for 1o0vA, the models cover the whole sequence. The benchmark also includes 3 protein complexes, each consisting of 2 proteins (Table 2), which were obtained from ‘Docking Benchmark 2.0’ 36. Here, the rigid bodies for each protein corresponded to crystallographic structures of the same sequence in a different state 36.
We aimed to map the accuracy of the predicted configurations as a function of rigid body accuracy. We modeled the configurations of rigid bodies as described above, employing the global and local search modes. For comparison, we also created an initial model with all rigid bodies superposed onto their counterparts in the native structure (RBNC, rigid bodies in the native configuration) using Chimera 37 and optimized the linker segments with respect to Sstereo and Soverlap. Starting from this initial model, we performed 10 independent local optimizations (“RBNC refined” in Supplementary Table S1). Similarly, we also performed 10 local optimizations starting with the native state (“native refined” in Supplementary Table S1).
The models were compared to the native state using two measures: (i) Cα RMSD with respect to the native structure and (ii) rigid-body translation Δr and rotation Δα of the rigid bodies relative to their positions in the native state. The Cα RMSD was calculated using the ‘superpose’ command of MODELLER-9.0. For calculation of Δr and Δα, the reference frame was defined by first superposing the largest rigid body from the model on its equivalent fragment in the native structure. Next, each of the remaining rigid bodies was rotated around its center of mass and subsequently translated such that it superposed with the equivalent part of the native structure. The corresponding rotation and translation define Δα and Δr, respectively. If the model consisted of more than two rigid bodies, we computed the mean values of Δα and Δr to characterize the similarity between two configurations of rigid bodies, always using the largest rigid body to define the reference frame.
We prepared XI solutions with concentrations of 0.55, 1.1, 2.7, and approximately 20 mg/ml from XI crystals (Hampton Research, Aliso Viejo, CA) and a buffer solution containing 10mM Hepes (pH 7.4) and 150 mM NaCl. SAXS profiles were recorded at Beam Line 4-2 at the Stanford Synchrotron Radiation Laboratory 38: Each solution was placed in a cuvette, which was maintained at 20 °C and located 2.5 m from a MarCCD165 detector (MarUSA, Evanston, IL). Twenty 15 second exposures (X-ray wavelength λ=1.381 Å) were acquired in series for each concentration. For a range of concentrations, we obtained approximately constant radii of gyration RG in the RG·qmin-RG·qmax range of 0.37-1.27, indicating no protein aggregation or changes in the homotetramer quaternary structure: RG of 32.7±0.4 Å-1 at 0.55 mg/ml, 32.9±0.2 Å-1 at 1.1 mg/ml, and 32.8±0.1 Å-1 at 2.7 mg/ml; these RG values are similar to those reported previously39. SAXS profiles were computed from the scattering images using MarParse 38 and profiles recorded at 2.7 mg/ml (0.01 Å-1 < q < 0.10 Å-1) and 20 mg/ml (0.055 Å-1 < q < 0.27 Å-1) concentrations were scaled and merged using Primus 40 The complete profiles ranged from q=0.01 Å-1 to q=0.27 Å-1. We finally resampled the profile on a uniform grid of 100 mesh points using linear interpolation.
We determined that it is not necessary to include the hydration shell of a protein in the Debye model for calculating SAXS profiles because the inclusion of the hydration shell has a much smaller impact on χ2 than the errors in an experimentally measured SAXS profile. Specifically, we compared the experimentally measured lysozyme SAXS profile with the profiles computed by our method and CRYSOL 41 (PDB 6lyz) (Fig. 2). The profiles calculated by our program and by CRYSOL agree with the experimental profile within its error, though the CRYSOL profile is a slightly better fit (χ2=0.20 versus χ2=0.26). We also determined that using a single consensus atomic shape has no notable effect on the accuracy of calculated SAXS profiles; that is, the χ2 difference between profiles calculated with the consensus and specific atomic shapes is approximately 0.001 (Supplementary Theory and Methods).
We first illustrate the method by its application to monomeric Diphteria toxin (Fig. 3A, PDB code 1mdt). We calculated the SAXS profile of the crystallographic structure: The q values ranged from qmin=0.02 Å-1 to qmax=0.5 Å-1 and the I(q) values were sampled on 100 equidistant grid points. Experimental error was simulated by adding white Gaussian noise with a standard deviation of 0.3 times I(qmax). The two Diphteria toxin domains were treated as rigid bodies; they were taken from the dimeric form of the protein, which possesses a dramatically different domain arrangement (Fig. 3B, PDB code 1ddtA).
In our approach, the predicted quaternary structure depends primarily on the experimental SSAXS term and the modeling terms Soverlap and SDOPE. The main role of the additional term Sstereo is to keep domains in relative proximity and thus restrict the number of possible conformations. To assess the effect of combining experimental information (SSAXS) and modeling information (Soverlap and SDOPE) on the predicted quaternary structures, we calculated models with different combined scoring functions consisting of (i) only Sstereo, Soverlap, and SDOPE; (ii) only Sstereo and SSAXS; and (iii) the complete S. One thousand independent optimizations were carried out for each scoring function variant in the global sampling mode. For each scoring function, we evaluated the score distribution and the Cα RMSD values of the models with respect to the native state (Fig. 3C-E). The first scoring function, which lacks SSAXS, has minima for configurations that differ substantially from the native state (Fig. 3C); none of the models has a Cα RMSD below 10 Å. For the scoring function including only SSAXS and Sstereo, a set of different configurations is consistent with the SAXS profile (χ2< 1) (Fig. 3D); the lowest scoring configurations are not near-native (their Cα RMSD is between 10 and 15 Å). In contrast, optimization of the complete S successfully results in near-native models (Fig. 3E). Thus, the combination of the SAXS and modeling terms results in a global minimum close to the native state, while either of the individual terms do not. This synergy justifies the integration of SAXS and protein structure modeling.
We further analyzed the models obtained from the optimization of S. Using Cα RMSD, we hierarchically clustered the 10% models with the best χ2 (Fig. 4A). The models clearly cluster into 3 distinct groups, separated by more than 12.5 Å Cα RMSD from each other. Each cluster is represented by the model with the best S (Fig. 4B-D). Albeit all models correspond to distinct configurations, their shapes and calculated profiles are similar, as expected from the nature of the SAXS restraint. The model from cluster III (Fig. 4D) has the lowest S and the lowest SAXS penalty (χ2=1.3 compared to 1.5, and 1.4 for the models from clusters I and II, respectively; Fig. 4E) and is closest to the native state in terms of its Cα RMSD (1.4 Å compared to 10.4 and 14.0 Å, respectively). Interestingly, cluster III does not constitute the largest cluster nor does it include the models with the lowest SDOPE (Supplementary table S1). Thus, in this case S and χ2 are indicative of the cluster with near-native models whereas SDOPE and the cluster size are not.
To assess the modeling protocol with statistical rigor, we applied it to our benchmark set of 9 multidomain proteins (Theory and Methods). The benchmark comprised proteins with 2 domains (5 cases), 3 domains (3 cases), and 4 domains (1 case) (Tables 1, S1). For five proteins, the domains were represented by experimentally determined structures, which are identical in sequence but part of a different assembly. The rigid bodies of the remaining four proteins were modeled by comparative modeling based on related template structures. Thus, the benchmark covered different scenarios in terms of protein size and available structural information.
After optimization and clustering, the cluster with the most accurate model (in terms of Cα RMSD) was termed the most accurate cluster. In the global search mode, the best-scoring models in the most accurate cluster were close to the native state (Cα RMSD < 4 Å) for 6 proteins, of medium accuracy for one protein (Cα RMSD = 12.8 Å), and of low accuracy in only two cases (Cα RMSD > 18 Å) (Tables 1, S1). When crystallographic domain structures were used, the resulting configurations were always highly accurate. In contrast, when comparative models were used, high accuracy configurations could only be obtained for 1cb6 and medium accuracy configurations for 1iknA. In our benchmark, these proteins also possessed the highest sequence identity to their template structures (63% and 46%, respectively). It is well established that the accuracy of a comparative model is correlated with the sequence identity on which it is based 42. Therefore, not surprisingly, the accuracy of whole protein models produced by our method is correlated with the sequence identity of the structural template for the modeling of the individual domains.
In real applications, we do not have the modeled native structure. Thus, we don't know which one of the generated models is most accurate, but need to select it based on some criterion. Possible criteria include the S and SSAXS (χ2) scores, as well as the cluster size. In all but one high- and medium-accuracy case (global sampling for 1mdtA; Figs. 3 and and4),4), the most accurate model was found in the largest cluster. Similarly, in all but one case (global sampling for 1cb6), the most accurate model was also the one with the best S and SSAXS scores. Thus, the selection of the final prediction by either one of these three criteria is likely to identify the most accurate of the generated models.
In almost all cases, the local search produced more accurate models than the global search. Only for 1mdtA, the local search resulted in a less accurate model than the global search, consistent with the largest distortion in the initial structure relative to the native state (Cα RMSD > 20 Å) among all benchmark cases. Remarkably, for the four-domain protein 1cb6 (Cα RMSD of the initial model is 8.2 Å), local sampling yielded more accurate results than global sampling, albeit global sampling results in models with lower scores. Thus, in this case, the near-native configurations are within the radius of convergence for the local sampling, whereas the global minimum, which is not a near-native configuration, is not. Based on the examples of 1mdtA and 1cb6, we conclude that the radius of convergence for local sampling is on the order of 10 Å.
We also applied our protocol to 3 binary protein complexes (Theory and Methods). The accuracy of binary complex models is generally lower than that for the individual two-domain proteins (Table 2). The highest accuracy was achieved for 1avxAB, whose best-scoring models had Cα RMSD of approximately 10 Å. This target was classified as ‘easy’ in Docking Benchmark 2.0 36. The high Cα RMSD is due largely to inaccurate relative orientation of the two proteins (Δα = 81.6°); in comparison, the relative position is quite accurate (Δr = 5.1 Å). Interestingly, local sampling in the vicinity of the RBNC state resulted in a model similar to the configuration using global sampling, which also scored better than the RBNC state according to S and χ2, but not SDOPE (Table S2).
The results for target 1ibr are similar to those for 1avx; the best scoring models were different from the native state and scored better than the RBNC state. The target 1ibr is classified as ‘difficult’ in Docking Benchmark 2.0 36.
For the third example, 1fq1 (‘difficult’ in Docking Benchmark 2.0 36,), the largest three clusters produced by the global sampling were of similar size and different from the native state (Cα RMSD > 30 Å). Moreover, the individual S, SSAXS, and SDOPE scores for the RBNC configuration were worse than those for optimized models.
If we assemble the native protein structures instead of their models, the local sampling around the native state always finds a configuration close to the native state (Cα RMSD < 1.2 Å); moreover, the optimized S, SSAXS, and SDOPE are significantly better than those for the assembly of non-native rigid bodies (Table S2).
To test our methodology on experimental rather than simulated profiles, we modeled the quaternary structure of D-xylose isomerase (XI) from Streptomyces rubiginosus. The model was calculated using three different approximations for the monomers: (i) the native subunit structure in the complex, (ii) a comparative model based on 4xim as a template (67% sequence identity), and (iii) a comparative model derived from 1a0d (28% sequence identity). The best possible superpositions of four copies of the monomer comparative models onto the native structure of the XI complex, ie, the Ca RMSD values of the RBNCs, were 2.7 Å and 5.1 Å for 4xim and 1a0d, respectively.
To reproduce the 222 symmetry of the XI tetramer consisting of identical subunits A, B, C, and D, we added a symmetry term Ssym to S. Specifically, we restrained equivalent intra-subunit Cα atom distances between equivalent chain pairs to be similar to each other 26; 43.
We acquired an experimental XI SAXS profile ranging from qmin=0.01 Å-1 to qmax=0.27 Å-1 (Theory and Methods). For each of the three assembly calculations, we built 500 configurations using the global sampling mode and clustered the 10% best-scoring (χ2) configurations (Table 3). Because XI is a homotetramer, the labels on the four subunits are irrelevant when comparing two configurations; ie, swapping monomers has absolutely no impact on the structure. Therefore, we calculated Cα RMSDs between a model and the native state for all subunit permutations in the model and chose the minimum value as the final Cα RMSD.
For the native monomer, the largest cluster contains the models with the lowest score S as well as the most native-like models (Cα RMSD = 3.5Å), indicating an accurate prediction. However, if we only consider SSAXS, the best scoring models are not closest to the native state (Fig. 5A-D), illustrating the positive role played by the other terms in the combined scoring function S (Fig. 5E, F). The agreement between the model SAXS profiles and the experimental profile is generally lower than that in the benchmark simulations (χ2min ≈ 8 versus χ2min ≈ 1). We also checked our χ2 against that implemented in CRYSOL 41, confirming that the CRYSOL χ2 likewise favors non-native solutions and correlates well with our χ2 (Supplementary Fig. S1). Comparison of the SAXS profiles of the best scoring models in the largest cluster with the experimental and simulated SAXS profiles of the native state reveals that deviations from the experimental profile are most significant in the range of 0.11 < q < 0.12 Å-1 (Fig. 5A-C). Similar observations were reported previously for XI 39. In contrast to SSAXS, SDOPE is substantially lower for models in the cluster close to the native state than for the models in the other clusters.
The calculation with the 4xim-based monomer models follows the same trends as those with the native subunit (Table 3 and Fig. 5E). Therefore, in a realistic setting, our modeling protocol would have determined correctly the quaternary arrangement of four subunits of the XI monomer, using its experimental SAXS profile and a high accuracy subunit model based on 67% sequence identity to the template structure.
In contrast, the results for the comparative model based on the remotely related 1a0d are qualitatively different. In this case, the largest cluster does not contain the most nativelike models (Cα RMSD = 49.1 Å). If we select the cluster according to S or SDOPE, the model is closer to the native state, albeit far from it in terms of Cα RMSD (15.3 Å). Nevertheless, this model still predicts correctly many of the residues at the subunit interfaces: 14% of the native contacts (the number of correctly predicted residue–residue contacts in the model divided by the number of contacts in the native structure) are identified correctly, which is considered ‘acceptable’ in blind assessments of protein docking methods at CAPRI meetings 44.
We now analyze the extent to which model accuracy is limited by sampling and scoring; sampling is limiting when configurations close to the native state are not generated during optimization and scoring is limiting when the most native-like configurations do not correspond to the global minimum of S.
For three of our benchmark cases (1ha0, 1ibr, and 1ko9), we plotted the best score (Smin), the corresponding Cα RMSD (RMSD(Smin)), and the minimum Cα RMSD (RMSDmin) for a set of structures resulting from global sampling against the number of independent optimizations (Fig. 6). For the two-domain protein 1ha0, Smin, RMSD(Smin), and RMSDmin do not improve substantially beyond 100 independent optimizations. Moreover, global sampling performs as well as local sampling starting from the RBNC (Fig. 6A; Supplementary Table S1). For the protein complex 1ibr, Smin and RMSD(Smin) reach a plateau at approximately 100 optimizations. RMSD(Smin) asymptotically reaches the value for local sampling around the RBNC, which is well above 10 Å, showing that a non-native configuration scores better than the RBNC (Fig. 6B; Table S2). The values for RMSDmin asymptotically approach a value of approximately 3 Å (Cα RMSD of RBNC is 2.3 Å), indicating that near-native configurations are sampled, but do not score well. In summary, we observed for all two-component systems (two-domain proteins and binary complexes) that the best-scoring models scored approximately the same as the refined RNBC configurations; therefore, sampling does not appear to be limiting the accuracy of our predictions in these cases.
For the three-domain protein 1ko9, Smin, RMSD(Smin) and RMSDmin improve slowly with an increase in the number of independent optimizations, beginning to reach a plateau at 400 independent optimizations. Smin from global optimization exceeds Smin of the models obtained by refining the RBNC, indicating that in this case the accuracy of the predicted quaternary structures is limited by sampling. Similar results are obtained for the other three- and four domain proteins; the global minimum of S could only be reached using global sampling if we performed at least 1,000 independent optimizations. However, when a sufficiently accurate starting configuration is available (eg, from a similar template protein), highly accurate configuration models can be obtained using only 100 independent optimizations (Cα RMSD < 2 Å) (Supplementary Table S1).
We incorporated information from a SAXS profile into protein structure modeling by satisfaction of spatial restraints, implemented in our Integrative Modeling Platform (IMP) 3; 26. We then benchmarked IMP by modeling quaternary structures of multidomain proteins and protein assemblies using rigid domains and proteins, respectively. We discuss here (i) the relationship between our method and those of others, (ii) the benefits of integrating protein structure modeling and SAXS fitting, (iii) the limitations arising from inaccurate scoring, imperfect sampling, and errors in rigid bodies, as well as (iv) the scope for integration of additional information into the modeling process.
Recently, experimental SAXS profiles have also been used to calculate atomic models of proteins by BUNCH 22 and CNS, which relies on NMR-derived restraints as well as a SAXS profile 24. We now outline similarities and differences between these two approaches and that of ours.
Our SAXS penalty is similar to the score implemented previously in CNS 24. Identically, both scores employ the Debye formula to treat the excluded solvent and rely on χ2 as the SAXS penalty. Moreover, they both calculate the partial derivatives allowing them to use gradient-based optimization techniques. However, the computation of the SAXS penalty and its derivative in our approach is significantly faster compared to the original implementation because we employ the electron pair distribution function P(r) for the calculation of χ2 and its derivative (Supplementary Theory and Methods). The gain in efficiency depends on the granularity of I(q) sampling. For example, we reduce the computation time by two orders of magnitude for a dataset with 100 data points and more for finer-sampled profiles. This gain in computational speed does not reduce the precision of the calculated SAXS profiles by more than the typical precision of experimental SAXS profiles. Moreover, we can gain additional efficiency relative to CNS through the use of rigid bodies, which required changes in the calculation of the partial derivatives of SSAXS (Supplementary Material). The gains in computational efficiency are important because they allowed us to sample the space of possible solutions more exhaustively.
Our SAXS penalty is different from the penalty in BUNCH 22, which is calculated by CRYSOL 41 and up-weighs high frequency components in χ2. While we also tested such a scoring function, it did not result in a significant improvement for our benchmark. For example, we modeled the Diphteria toxin using a χ2 that weights frequencies according to q2. The corresponding SSAXS term did not result in a more accurate model if used only in conjunction with Sstereo (Supplementary Fig. S1). It also did not produce a higher yield of near-native configurations when using the full score S, compared to the uniform weighting (Supplementary Fig. S1). Also in our experimental test case XI, the IMP and BUNCH scores are highly correlated (Supplementary Fig. S2). Therefore, for parsimony, we used the scoring function with uniform weighing of the SAXS profile at different frequencies. Moreover, unlike our score, the BUNCH SAXS penalty includes two additional fitting parameters that modify Im 41, corresponding to the average displaced volume per atomic group and the density of the hydration layer. The slightly better fit between the CRYSOL and experimental profiles of lysozyme (Fig. 2) is likely due to these additional parameters, but their effect is negligibly small (Δχ2=0.05). Thus, the additional fitting parameters in CRYSOL cannot explain the errors in the calculated profile of XI for 0.11Å-1 < q < 0.12Å-1 (Fig. 5A). Likewise, the program solX45 calculates SAXS profiles of known atomic structures that fit experimental data as well as CRYSOL's profiles6 without employing fitting parameters other than the protein concentration similarly to our approach. It is unlikely that errors in our experimental data collection are responsible for this discrepancy because independent SAXS measurements of XI yielded comparable differences between experimental profiles and profiles computed from the crystallographic XI structure 39. Moreover, qualitatively similar discrepancies could be observed for E. coli aspartate transcarbamylase (ATCase), a 303 kDa homo-dodecamer, in the original CRYSOL publication 41. The calculated ATCase profile has a pronounced local minimum at q ≈ 0.08 Å-1, while the corresponding minimum in the experimental SAXS profile is significantly less prominent. In addition to the imperfect calculation of the SAXS profile for a given structure, these discrepancies might also reflect structural differences between the crystallographic and solution structures.
Our optimization protocol consists of independent minimizations of the scoring function from many random starting configurations, with the aim to sample the entire configuration space (for the global optimization mode) (Fig. 1). For each minimization, we use a simulated annealing biased-Monte Carlo algorithm; each Monte Carlo step is followed by a local quasi-Newton relaxation, for which the first derivatives of the scoring function are needed. Thus, the biased-Monte Carlo process samples only local minima. In contrast, BUNCH 22 employs a conventional simulated annealing Monte Carlo protocol, in which the sampling is not restricted to local minima. However, in many applications, such as X-ray crystallography 46, NMR spectroscopy 47, comparative protein structure modeling 28, and ab initio structure prediction of proteins 48 and assemblies 49, optimization methods that employ the first derivatives are known to be significantly more efficient than Monte Carlo schemes. Thus, we efficiently implemented and used the first derivatives in our optimization.
A perennial problem in structure modeling is whether or not an optimization scheme finds all the good scoring solutions. To at least partly address this problem, we run many minimizations in parallel and independently on a large computer cluster (ie, hundreds of nodes). The resulting large sample of solutions is then clustered, to present them more parsimoniously for subsequent analysis. “By construction, the structures in one cluster are generally distinct from the structures in another cluster; they involve dissimilar interfaces and have Cα-RMSD values worse than 12 Å (Figs. 4, ,5).5). An analysis of the sampling shows that our protocol usually finds the global minimum for up to four rigid bodies (below); thus, most good-scoring local minima are also expected to be sampled in these cases. A large computer cluster with at least 100 processors is currently needed for an efficient use of our method, perhaps limiting its practical utility. Nevertheless, such computing clusters are becoming increasingly available to many users. In addition, our software is being adapted to run on graphics processing units, such as Nvidia's Tesla with 240 processors (http://www.nvidia.com/object/tesla_c1060.html), which might enable efficient application on a single desktop computer.
Integrative computational methods can exploit various kinds of spatial information to determine the assembly structures at higher accuracy and precision than is possible based on each individual type of data 3; 26; in conjunction, pieces of data that are relatively uninformative by themselves can still result in accurate and precise models of proteins and assemblies.
Here, we combined a SAXS profile with information about protein structures that can be calculated only from their sequence. Specifically, we supplemented the SAXS term (SSAXS) by the penalties for steric clashes (Soverlap), an atomic distance-dependent statistical potential (SDOPE), stereochemistry restraints from a molecular mechanics force field (Sstereo), and symmetry restraints (Ssym). Importantly, an integration of these different types of information can significantly improve modeling accuracy relative to relying on a SAXS profile or protein structure modeling alone (eg, Fig. 3; 1cb6 in Tab. 1; XI in Fig. 5). Moreover, the sum of the different terms was not observed to be less accurate than any of the individual terms; in other words, when the final configurations were inaccurate, no optimization of the individual terms, including SSAXS, produced an accurate configuration.
Our method does not necessarily predict unique best-scoring solutions; due to the low information content of the input restraints, models from different clusters can have comparable scores. For example, SSAXS is completely invariant to rotations of a spherical rigid body. For the benchmark case 1cb6, models from the near-native cluster and a non-native cluster have similar scores (Supplementary Table S1). Nevertheless, due to the combination of different types of information, the number of distinct configurations compatible with the input data is generally much smaller compared to using only a single type of information (eg, Fig. 3).
Our benchmark allows us to assess the limitations of the protocol and highlight opportunities for future research. Modeling by optimization depends on two conditions: (i) The scoring function needs to have a global minimum at the native or near-native state and (ii) the sampling needs to be sufficiently thorough to find the global or near-global minimum. Therefore, we tested the degree to which our method is limited by the accuracy of the scoring function and the thoroughness of sampling. We also asked how accurate do the rigid bodies need to be so that the scoring function still has the global minimum at the native state. The assessment of sampling allowed us to judge when a global search without reliance on a suitable initial structure can be successful; or, conversely, when we need a sufficiently accurate initial model so that at least a near-native state can be found by local sampling alone. We also analyzed the accuracy of the method as a function of the number of rigid bodies, which allows us to further qualify sampling and scoring limitations.
For systems of two rigid bodies (ie, two-domain proteins and binary protein complexes), even the global sampling produced numerous configurations close to the global minimum of the scoring function S (Fig. 6A, B; Supplementary Tables S1 and S2). Thus, the accuracy of these models is largely determined by the accuracy of S. The global minimum of S corresponded to the native or near-native state only if the rigid bodies were not too distorted (ie, Cα RMSD of less than 3 Å). Therefore, as expected, the accuracy of the rigid bodies crucially influences the landscape of S.
Among the individual terms of S, it is not surprising that SDOPE sometimes favors non-native configurations if sufficiently distorted rigid bodies are used. Errors in the positions of exposed atoms interfere with their packing, which is the aspect scored by SDOPE. Moreover, distorted rigid bodies can cause steric clashes and thus increase Soverlap. In contrast to SDOPE, a strong dependence of SSAXS on the rigid body accuracy is somewhat surprising; in all cases, the rigid domains and proteins possessed the correct fold, but SSAXS did not always favor near-native configurations. If the Cα RMSD errors of the rigid bodies were above approximately 3 Å, configurations with minimum SSAXS were significantly different from the native state (ie, higher than 18 Å for Cα RMSD). This result can be explained by the ambiguity of the SAXS restraint; many different quaternary structures have a similar envelope and therefore similar SAXS profiles. Subtle structural differences can then make SSAXS of near-native configurations worse than that for some non-native configurations. The deteriorating effect of errors in the rigid bodies on the accuracy of the predicted configurations turns out to be most pronounced for complexes (Table 2, Supplementary Table S2). For multidomain proteins, the connecting linker limits the number of possible configurations. Thus, the number of possible false minima of S is generally larger for protein complexes than for monomeric proteins. We anticipate that the precise degree of tolerable structural differences will also depend on the shape of the rigid body: If a rigid body does not have a distinctive shape (eg, it is a sphere or a cylinder), small distortions can be sufficient to alter the position of the global minimum; for more distinct shapes, the sensitivity of the prediction accuracy to the rigid body errors is expected to be smaller.
For three or more rigid bodies, the typical number of independent optimizations we used (1000) was insufficient to reliably find the global minimum of S in a global search (Fig. 6C). For four domains (benchmark case 1cb6), we increased the number of initial configurations to 5,000, requiring 3 days on 200 CPUs. Such an exhaustive sampling is impossible without employing a large computer cluster. However, configurations close to the global minimum of S could be found at dramatically reduced computational cost employing local sampling, if the template (initial) configuration was sufficiently close to the native state (Cα RMSD < 10 Å). In such a case, the local search is more efficient than the global search because no computing time is wasted on searching far from the native configuration. In the future, the development of computationally more expensive and sophisticated sampling strategies may allow sampling the configurations of five or more domains with a larger radius of convergence than that of our present optimization.
To overcome the limitations on prediction accuracy arising from rigid body errors, we probably need to abandon the rigid body approximation. In the field of protein-protein docking, simultaneous sampling of different component conformations and their configuration has been described49. Similar approaches could be used for fitting a configuration to a SAXS profile, requiring both more sophisticated scoring functions and sampling algorithms than those described here. The use of high-angle scattering data might allow us to compute atomic structures more accurately, but this goal would require flexible modeling of the atomic structures. Structural changes at the domain level result in signals at frequencies beyond q = 0.5 Å-1, which we did not consider in our calculations because we represented domains and proteins as rigid bodies.
Our protocol currently relies on the sample protein or complex existing in a single state. However, proteins and complexes can exist in equilibrium among different states, corresponding to varied packing between secondary structure segments, domains, and proteins as well as variations in unstructured regions such as long domain linkers. An approach to score a given ensemble of models has been proposed recently 51. Our protocol could also be extended to optimize an ensemble of models using a similar score. However, the increased computational effort as well as the limited information in SAXS profiles will limit such approaches.
Given the relatively low information content of a SAXS profile and the limitations in our protein structure modeling terms, incorporating additional information into the scoring function S is desirable. Using supplementary information is further justified by the sensitivity of S to rigid body errors and possible systematic experimental errors in a SAXS profile (eg, due to aggregation of macromolecules in solution). Further integration is facilitated by our implementation of SAXS fitting in IMP, which can already produce models by simultaneously satisfying a large variety of other spatial restraints. In addition to the SAXS and modeling terms used here, IMP can incorporate (i) restraints implied by an alignment between the modeled sequence and related structures 28, (ii) restraints implied by an alignment of the modeled sequence and many short segments of known structure, (iii) bioinformatics analysis of protein interaction modes 52, (iv) protein-protein docking that is restrained by the composition of interacting surfaces determined by NMR spectroscopy 53, (v) symmetry and density from a cryo-EM map of the assembly 54, and (vi) proximity of subunits inferred from immuno-affinity purifications, yeast two-hybrid system, footprinting, and chemical cross-linking 26. The global shape information from SAXS is especially complementary to local restraints, such as the atomic distance restraints derived from chemical cross-linking detected by mass spectrometry 55. Another attractive application is the integration of SAXS for structural characterization of component structures of a large assembly whose overall density is determined by cryo-EM.
An accurate quaternary structure model of a protein or a protein assembly can be obtained using only a SAXS profile, stereochemistry restraints from a molecular mechanics force field, and an atomic distance-dependent statistical potential, provided sufficiently accurate approximations for the constituent domain and protein structures are available. Otherwise, the predictions are generally ambiguous and have large errors. Our integration of a SAXS profile into modeling by satisfaction of spatial restraints will facilitate further integration of different kinds of data for structure determination of proteins and their assemblies.
We thank Maya Topf, Narayanan Eswar, Frank Alber, Fred Davis, Min-Yi Shen, and Marc Marti-Renom for fruitful discussions. FF is grateful to a long-term fellowship from the Human Frontier Science Project Organization (HFSPO). KAK was supported by a NSDEG Graduate Fellowship. SSRL is funded by DOE BES, and the SSRL Structural Molecular Biology Program is supported by DOE, OBER and NIH NCRR BTP Grant (P41 RR001209). DAA has been supported by the Howard Hughes Medical Institute, DAA and AS have been supported by a UC Discovery Grant (bio03-10401/Agard). AS has also been supported by The Sandler Family Supporting Foundation, NIH (R01 GM54762, R01 GM083960, U54 RR022220, and PN2 EY016525), NSF (EIA-032645 and IIS-0705196), Hewlett-Packard, NetApps, IBM, and Intel.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.