Structural description of macromolecular assemblies is essential for a mechanistic understanding of the cell1
. The scope of the problem is revealed by protein interaction studies: The yeast cell contains approximately 800 distinct core complexes of 4.9 proteins on average2
, most of which have not yet been structurally characterized3
. The human proteome is likely to have an order of magnitude more distinct assemblies than the yeast cell. Therefore, there are thousands of biologically relevant assemblies whose structures still need to be determined.
Structural determination of macromolecular assemblies is a major challenge in structural biology. X-ray crystallography can provide structures of stable assemblies at atomic resolution4
. However, there are many other assemblies that are refractory to crystallographic determination. A low-resolution structure of these assemblies can be determined by cryo-electron microscopy (cryoEM)5
. The resolution usually ranges from 4 Å, where the backbone of the protein can be traced, to 30 Å, where only the outer envelope of the assembly is visible6
The increasing numbers of the atomic and cryoEM datasets7
have stimulated the development of computational techniques for fitting atomic structures of assembly components into a cryoEM density map of the whole assembly. The result is a pseudo-atomic model of the assembly that can reveal significant insights into its structure, dynamics, function, and evolution8–12
Here, we focus on determining the positions and orientations (i.e., placements) of multiple atomic component models within the assembly density. When the structure of a homologous assembly (template) is available, the placements of the components can be computed by fitting the template into the target assembly density, superposing the target component models on the corresponding template components, and refining the model13; 14
. Alternatively, the component positions can be determined experimentally by a number of protein labeling methods, relying for example on gold-labeled antibodies15
. However, when only a cryoEM map and component structures are available, a general method for solving the configuration problem is not yet available.
A sequential method for fitting multiple components into an assembly map has been described16
. The method starts by fitting the largest component into the map, followed by an iterative fitting of the largest remaining component into the unoccupied density, until all components are fitted. The fitting of a component into a given map can be performed manually using interactive visualization tools17
. More desirably, automated fitting methods that assess the placement of a component by a fit between the component and a segmented6
or complete density of the assembly can also be used; the fit is optimized over the translational and rotational degrees of freedom of a rigid component relative to the map18
. The sequential method is applicable if the components to be fitted dominate the unoccupied densities. Unfortunately, this condition is generally not satisfied, especially when the resolution is low, the number of components is large, and component models are inaccurate19
. For example, sequential fitting is not expected to work for the 19S proteasome with 18 component proteins20
, the mammalian ribosome for which 30 out of 80 proteins are not present in the known archaeal or bacterial ribosomes13
, nor the ryanodine receptor isoform 1 (RyR1) for which some domains are poorly modeled while for others no template is available21
Here, we describe a method named MultiFit for determining the configuration of multiple high-resolution component structures based on the quality-of-fit of each component into the density map, the protrusion of each component from the map envelope, and the shape complementarity between pairs of components. The combination of these terms reduces the ambiguity of the final solution, compared to using any individual term on its own.
The task of sampling the configuration space is challenging because the placement of a component depends on the placements of other components. MultiFit tackles this combinatorial challenge by reformulating the problem as an inferential optimization over a discrete sampling space. In outline, a discrete set of possible placements for each component is first generated independently of other components. Next, the globally optimal combination of placements with respect to a scoring function is found by a combination of branch-and-bound search and the DOMINO (Discrete Optimization of Multiple INteracting Objects) inferential optimizer. The relative translations and orientations of pairs of components in the best ranking configurations are then refined; specifically, a refined discrete sampling space is generated by pairwise geometrical docking between interacting components, and the optimal refined combination of placements is again found using DOMINO. We successfully validated the method on a simulated benchmark of 6 assemblies, consisting of up to 7 proteins each. In addition, for a more realistic test, we determined the configuration of 4 domains in the subunit of GroES-ADP7-GroEL-ATP7 chaperonin from Echerichia coli
based on an experimentally determined map at the resolution of 23.5 Å22
. A near-native configuration scored best in 4 test cases, 3rd
best in 2 cases, and 4th
best in the remaining case.
Below, we begin with a detailed description of general combinatorial optimization by DOMINO, followed by a formal definition of the component configuration problem and the MultiFit algorithm to solve it using DOMINO (Theory). We then demonstrate the performance of MultiFit on the benchmark cases (Results). Finally, we discuss the implications of MultiFit and DOMINO for structural characterization of large assemblies (Discussion).