|Home | About | Journals | Submit | Contact Us | Français|
Structural models of macromolecular assemblies are instrumental for gaining a mechanistic understanding of cellular processes. Determining these structures is a major challenge for experimental techniques, such as X-ray crystallography, NMR spectroscopy and electron microscopy. Thus, computational modeling techniques, including molecular docking, are required. The development of most molecular docking methods has so far been focused on modeling of binary complexes. We have recently introduced the MultiFit method for modeling the structure of a multi-subunit complex by simultaneously optimizing the fit of the model into an electron microscopy density map of the entire complex and the shape complementarity between interacting subunits. Here, we report algorithmic advances of the MultiFit method that result in an efficient and accurate assembly of the input subunits into their density map. The successful predictions and the increasing number of complexes being characterized by electron microscopy suggests that the CAPRI challenge could be extended to include docking-based modeling of macromolecular assemblies guided by electron microscopy.
Protein complexes are vital molecular machines of the cell1; structural characterization of these complexes provides insight into their function2. Given the number of undetermined protein complexes3 and inherent difficulties in experimentally determining the structures of these complexes at atomic resolution4, there is an acute need to develop computational methods for structural modeling of macromolecular assemblies.
Molecular docking techniques have traditionally been used to predict binary complexes given their unbound component structures. These methods rely on a global search of a large set of possible binary configurations, maximizing geometrical and physicochemical complementarities between a pair of constituent subunits5–9. The CAPRI challenge provides a critical assessment of such docking methods10. Analysis of CAPRI results shows that in some cases docking methods suffer from relatively low accuracy, especially when the individual protein subunits are modeled or when their bound and unbound conformations significantly differ9,11–13. This limitation has led to the emergence of restrained docking procedures that guide sampling and/or filter docking solutions based on additional sources of information9,14,15. Notably, the HADDOCK webserver can incorporate multiple sources of information into its docking procedure16.
While most docking methods are designed to deal with two molecules, the majority of functional macromolecular assemblies in the cell consist of more than two components. Inspired by targets in previous CAPRI rounds12, several groups have proposed docking-based modeling methods for symmetric complexes17–21. Fewer docking methods have been developed for the significantly more challenging case of asymmetric assembly modeling14,22. A major obstacle for macromolecular docking algorithms is the ability to select near-native models from an ensemble of possible solutions. Knowledge of the overall shape of a complex, even at low resolution, can significantly reduce ambiguity inherent in such an ensemble when it is used to filter the set of candidate models. Such overall shape information can be obtained by Electron microscopy (EM) 23–25 or small angle X-ray scattering 26,27 techniques.
EM is becoming a method of choice for structural visualization of large protein complexes. EM reconstruction techniques provide a density map of a complex at resolutions typically ranging from 5 Å to 25 Å28. Following generation of a density map, atomic structures of complex components are often fitted into the map to construct a “quasiatomic” model of the complex 29–31. Thus, EM data can be used not only to filter docking solutions but also to fit assembly subunits into their density. Rigid fitting techniques rely on a global search for the placement (position and orientation) of a single subunit inside the density map that maximizes the overlap between the model and the map32. However, the majority of these techniques are designed to work independently on single subunits, without taking into account protein-protein interaction interfaces.
To combine the strengths of molecular docking and molecular fitting approaches, and to overcome their limitations, we have developed the MultiFit method. MultiFit simultaneously positions protein subunits into a density map of a protein assembly by combining geometric principles commonly used in molecular fitting and molecular docking33. Here, we describe new algorithms for two of the stages of the MultiFit algorithm that significantly improve the accuracy of the method. In addition, we describe an extension of the MultiFit method for cyclic symmetric assemblies, resulting in a highly efficient algorithm that accurately treats such cases.
Below, we outline the MultiFit algorithm and describe the recent algorithmic advances. We then illustrate the method by modeling the structure of the methane monooxygenase enzyme (asymmetric complex) and the GroEL chaperone (cyclic symmetric complex), followed by results on a 10 complex benchmark. Finally, we discuss the advantages of incorporating EM data in macromolecular docking algorithms.
MultiFit is a computational method for simultaneous fitting of atomic protein structures into a protein assembly density map at resolutions as low as 25 Å. The input to the method is a set of atomic structures of subunits and an EM density map of their assembly. The MultiFit algorithm simultaneously fits the subunits into their assembly density map and optimizes the interfaces between neighboring subunits. The method’s output is a ranked list of assembly models. An assembly model of n subunits is defined as a set of n rigid three-dimensional (3D) transformations, each applied on a corresponding assembly subunit.
Assembly models are ranked by a geometric scoring function composed of a linear combination of three terms: (i) the quality-of-fit term scores how well a model fits into the assembly density map, (ii) the protrusion term scores how well each subunit is placed inside the density envelope and (iii) the interaction term scores the pairwise shape complementarity between pairs of interacting subunits and also accounts for their excluded volume. We use a combination of these three terms, as each alone is insufficient for an unambiguous identification of the native configuration. A detailed mathematical description of these terms is provided elsewhere33.
The optimization algorithm is composed of four stages, each sampling assembly models at increasingly higher resolution and accuracy, further restricting the search space to be sampled in the following stage (Figure 1): (i) anchor graph segmentation, (ii) fitting-based assembly configuration, (iii) docking-based pose refinement, and (iv) rigid-body minimization. In “anchor graph segmentation”, an unlabeled segmentation of the density map into n regions is calculated using a Gaussian mixture model clustering procedure; the segmented n regions correspond approximately to the regions allocated by the n subunits in the complex. In “fitting-based assembly configuration”, a set of coarse assembly models is found by an enumeration over possible assignments of subunits to regions, followed by simultaneous local fitting of the subunits in the corresponding regions. In “docking-based pose refinement”, each of the models found in the “configuration” stage is refined by simultaneous local optimization of the interfaces between pairs of interacting subunits. In “rigid body minimization”, each of the models found in the “refinement” stage is further refined using a local Monte Carlo/conjugate gradients minimization procedure34. Detailed description of the original optimization procedure is provided in a previous publication33. The recently developed algorithms for the “anchor graph segmentation” and “fitting-based assembly configuration” stages are described in Supplementary Materials.
Many of the protein complexes determined by EM techniques are symmetric; reconstruction algorithms exploit the symmetry constraint to enhance the resolution of the generated density map28,35–37. Symmetry can also be exploited for the purpose of fitting multiple subunits into a density map of their symmetric assembly. Inspired by the success of symmetric docking algorithms20,21, we have extended MultiFit to exploit cyclic symmetry (Cn) in the optimization algorithm (Cn_MultiFit). The optimization procedure of Cn_MultiFit is composed of the following stages: (i) symmetry axis detection, (ii) anchor graph segmentation, (iii) fitting-based Cn assembly configuration, and (iv) rigid-body minimization (Figure 1). The main differences between the optimization procedures of MultiFit and Cn_MultiFit are in the “symmetry axis detection” and the “fitting-based Cn assembly configuration” stages, as described below.
A principal component analysis38 (PCA) based procedure is applied to determine the symmetry axis of a Cn symmetric complex. Specifically, the procedure first calculates three principal axes for the set of 3D coordinates of density map voxels that have density values within the top 20% of those for voxels in the density map. It can be shown that the assembly symmetry axis is one of its density map’s principal axes39. A statistical consistency score is then applied to identify the symmetry axis among the three principal axes39.
First, a single asymmetric subunit is fitted to a segmented region of the density map. Then, for each of the top ten fitting hypotheses, possible ring models of n copies around the complex symmetry axis are sampled. Specifically, the ring models are constructed by applying n-1 symmetry operations to the fitted asymmetric subunit. The symmetry operation that minimizes the MultiFit scoring function is selected among transformations with rotations of 360/n ± 5° around the symmetry axis and translations of ±3Å.
To illustrate the MultiFit algorithm, we describe in detail an application to the methane monooxygenase (MMO) enzyme. The MMO enzyme plays a critical role in the metabolic pathway of Methanotrophic bacteria. It is composed of 6 subunits arranged as a dimer of hetro-trimers. We demonstrate that the structure of the MMO enzyme can be determined by simultaneously fitting its subunits into the assembly density. A density map was simulated from the MMO hydroxylase crystal structure (PDB entry 1MTY40) using the pdb2vol command of Situs41. The structures of the α, β and γ subunits were modeled using templates with sequence identities ranging from 21 to 99 using the MODELLER software42 (PDB entries 1xmgD43, 2indA44 and 1xveF45). The Cα-RMSDs between the models of the α, β and γ subunits and their bound conformations were 2.26 Å, 9.36 Å and 0.82 Å, correspondingly. MultiFit solutions were validated against a reference structure constructed by superposing the α, β and γ subunits models on the assembly crystal structure.
In the “anchor graph segmentation” stage, the assembly density map was segmented into six regions that correspond approximately to locations of the six subunits in their assembly. The segmentation procedure separated the density into such regions well, even though the shapes of the subunits were not part of the input of the procedure (Figure 1). The segmented density map was represented as an anchor graph that provides an unlabeled representation of the assembly topology. The nodes of the anchor graph correspond to the centroids of the segmented regions while edges are defined between pairs of neighboring regions. In the “fitting-based assembly configuration” stage, coarse assembly models were determined for possible assignments of subunits to anchor graph nodes. In detail, for each possible labeling of subunits to the anchor graph nodes (i.e., positioning of the assembly subunits centroids at the centroids of the segmented regions), a discrete sampling space was generated by locally fitting each subunit into its assigned region. The DOMINO optimizer33 was applied to search for the best scoring combination of fitting solutions. The Cα RMSD of the top 20 scored models to the reference complex ranges from 5.7 Å to 15.2 Å to the reference complex. Each such solution provides relatively accurate positioning of the subunits in the assembly; however, the interfaces were inaccurate as the sampling was performed independently on each subunit.
These solutions were then refined in the “docking-based pose refinement” stage. Each model found in the previous stage suggests a pairwise interaction map of the complex and approximate interfaces. A new discrete sampling space was generated by running the restrained pairwise docking program PatchDock46 on predicted interacting subunits. The PatchDock procedure was preformed with updated parameters that allowed for nonnegligible steric clashes. The DOMINO optimizer was again applied to search for the best scoring combinations of the resulting docking solutions. The result of this optimization stage was a set of 20 models, with Cα RMSD ranging from 3.9 Å to 10.5 Å to the reference complex. Finally, a rigid-body minimization procedure was applied to each of these models resulting in a best scoring model with 3.2 Å Cα RMSD to the reference complex.
For comparison, we modeled the MMO enzyme complex given its bound subunits as input. The Cα RMSD of the final model to the native complex was 1.8 Å. In the bound case, the performance of the pairwise docking algorithm could have been sufficient to model the assembly by combinatorial docking22. However, in the unbound case, due to the differences between the modeled subunits and their corresponding bound conformations, pairwise docking alone was not sufficient to supply useful intermediate results. In this example, the EM density map-based fitting procedure was crucial for detecting a near-native model.
To illustrate the Cn_MultiFit algorithm, we describe in detail modeling of the GroEL chaperone complex. GroEL is a bacterial chaperonin that assists in the proper folding of proteins. It is composed of two back-to-back 7-mer rings. The structure of the GroEL complex has been studied extensively by EM (the EM data base47 contains 30 density maps of the GroEL complex). The input to Cn_MultiFit was a 7-mer ring of the GroEL protein extracted from an experimental density map at 23.5 Å resolution (EMD entry 104648) and a monomeric GroEL structure, which was obtained from the corresponding atomic structure (PDB entry 1GRU48). In the “symmetry axis detection” stage, the symmetry axis was correctly identified (Figure 1). In the “anchor graph segmentation” stage, the density map was segmented into 7 regions. The centroids of the segmented regions accurately correlate to the centroids of the subunits in complex, further validating our symmetry axis prediction. In the “fitting-based Cn assembly configuration” stage, we fitted a single copy of the GroEL structure to the density and used the symmetry to build possible models. The result of this stage was a set of 20 models, with Cα RMSD ranging from 4.2 Å to 9.5 Å to the native complex. Finally, a rigid body minimization procedure was applied and a model with Cα RMSD of 3.4 Å was the top-ranked result.
We tested the MultiFit and Cn_MultiFit algorithms on a benchmark of additional 10 complexes, six of which are asymmetric and four of which are Cn symmetric (Table I, Figure 2). The complexes were obtained from the Protein Data Bank (PDB49) and were composed of two to seven subunits. The input to each test case was an assembly density map at 20 Å resolution and structures of the assembly subunits, in the bound state, at atomic resolution. The assembly density maps were simulated using the pdb2vol command in SITUS41. The accuracy of the final set of models was assessed by the assembly placement score33, and Cα-RMSD between each model and its corresponding native structure. In all 10 cases, a model with Cα-RMSD lower than 5.05 Å was found among the top 10 ranked models.
Our results demonstrate the relative robustness of MultiFit to inaccuracies in fitting and/or docking techniques. Benchmarking revealed that fitting into an assembly of subunits with different shapes (“mixed complexes”) was less reliable than fitting a subunit into an assembly of subunits with similar shapes (“uniform complexes”), such as Cn symmetric complexes. Reasons for the relatively ambiguous intermediate results of the “fitting-based assembly configuration” stage for the “mixed complexes” versus the “uniform complexes” include: (i) the nature of the cross-correlation measure, which is biased towards high-density regions of the map31, (ii) the reduction in the number of degrees of freedom derived from the imposed Cn symmetry for some of the “uniform complexes” and (iii) errors in the segmentation used in the “anchor graph segmentation” stage, resulting in segmented regions that do not completely correspond to subunits of mixed sizes and shapes. Despite ambiguous intermediate solutions obtained in fitting some of the subunits into their assembly densities in difficult cases, a near-native model (2.21 Å to 5.05 Å Cα RMSD) was found among the top 10 models. For example, for the 1TYQ assembly of 7 subunits (Table I, Figure 2), the positions of 4 out of the 7 subunits of the complex (chains D-F) were difficult to detect by fitting techniques. However, docking between pairs of interacting subunits, as detected in the “docking based pose refinement” stage, improved the placements of these subunits; most notable is the improvement from 14.6 Å to 2.3 Å Cα-RMSD for subunit G.
In addition, ranking of docking-based models by pairwise docking methods may be inaccurate10. The strength of using EM data as an additional source of information is again demonstrated by the 1TYQ example. The correct docking pose between subunits B and F was ranked only number 943 by the PatchDock procedure, but was the top ranked result by the MultiFit combined geometric score.
With the growing number of macromolecular assemblies characterized by EM47, data-guided modeling techniques are becoming increasingly useful for a mechanistic understanding of these assemblies. We have recently addressed the problem of modeling architectures of macromolecular complexes by simultaneously optimizing the fit of the individual subunits into their assembly density maps and optimizing the interfaces between interacting subunits33. Here, we report algorithmic advances in the density map segmentation and subunit fitting algorithms of the MultiFit method as well as a new algorithm for the modeling of cyclic symmetric complexes. We show that even low-resolution density maps are helpful for modeling assembly architectures and can resolve ambiguous intermediate docking or fitting results. As EM techniques continue to improve, an increasing number of macromolecular complexes will be visualized at sub-nanometer resolution. Integration of intermediate-to-high resolution density map data into computational docking techniques may be extremely useful in resolving ambiguities in docking of unbound subunits and in the refinement of docking solutions. Extending the CAPRI challenge to include docking-based modeling of macromolecular assemblies guided by EM would help to advance these methods and their applicability. The MultiFit software, benchmark, and a tutorial are available as part of the IMP package under the open source lesser-GPL license at http://www.salilab.org/MultiFit/. Remaining challenges include, among others, treating protein flexibility and incorporation of data from additional sources, such as those from proteomics50.
We thank Ben Webb and Daniel Russel for continuous help and support with the IMP framework, Jeremy Phillips for useful discussions about integrative modeling and careful reading of the manuscript, Dina Schneidman for help with the PatchDock program and the reviewers for constructive suggestions. KL has been supported by the Clore Foundation PhD Scholars program, and carried out her research in partial fulfillment of the requirements for the Ph.D. degree at TAU. AS has been supported by the Sandler Family Supporting Foundation, National Institutes of Health (R01 GM54762, U54 RR022220, PN2 EY016525, and R01 GM083960). We are also grateful to Ron Conway, Mike Homer, Hewlett-Packard, NetApp, IBM, and Intel for computer hardware gifts. HJW acknowledges support from the Israel Science Foundation (grant no. 1403/09) and from the Hermann Minkowski Minerva Center for Geometry at TAU.