2.1. Iterative model rebuilding, density modification and refinement
The purpose of the AutoBuild wizard is to provide a highly automated system for model rebuilding and completion. The wizard design allows the user to specify data files and parameters through an interactive GUI or alternatively through keyworded scripts. The AutoBuild wizard begins with data files with structure-factor amplitudes and uncertainties, along with either experimental phase information or a starting model, typically from molecular replacement. It then carries out cycles of model building and structure refinement alternating with model-based density modification and produces a relatively complete atomic model.
wizard has been designed for ease of use combined with maximal user control, with as many parameters set automatically by the wizard as possible, but maintaining parameters accessible to the user through a GUI and through keyword-based scripts. The wizard uses the input/output routines of the cctbx
library (Grosse-Kunstleve et al.
) allowing data files of many different formats, so that user data need not be converted to any particular format before using the wizard. Use of the phenix.refine
refinement package (Afonine et al.
) in the AutoBuild
wizard allows a high degree of automation of refinement so that neither the user nor the wizard is required to specify parameters for refinement. The phenix.refine
package automatically includes a robust bulk-solvent model and automatically places solvent molecules (Afonine et al.
The five core modules in the AutoBuild wizard are (i) building a new model into an electron-density map, (ii) rebuilding an existing model, (iii) refinement, (iv) iterative model building beginning from experimental phase information and (v) iterative model building beginning from a model. These five procedures are described in the next sections.
The standard procedures available in the AutoBuild wizard that are based on these modules include (1) model building and completion starting from experimental phases, (2) rebuilding a model from scratch, with or without experimental phase information, and (3) rebuilding a model in place, maintaining connectivity and sequence register. In cases where the starting point is a set of experimental phases and structure-factor amplitudes, procedure (1) is normally carried out and the resulting model is then rebuilt with procedure (2). In cases where the starting point is a model (e.g. from molecular replacement) and experimental structure-factor amplitudes, procedure (3) is normally carried out if the starting model differs by less than about 5% in sequence from the desired model; otherwise, procedure (2) is used.
2.2. Building a model into an electron-density map
The AutoBuild wizard has a multi-step procedure for building an initial model into an electron-density map. In this procedure, several models are built, refined and recombined with each other to create new models. If a model is available from a previous step or is provided by the user, this model can also be recombined with the other models. After each stage of building there is a single ‘best’ model and any number of additional models that have been constructed up to that point.
Initial models are scored based on the number of residues built (N
built), the number of residues assigned to the sequence (N
placed) and the number of chains in the model (N
chains). A large number of chains typically indicates that there are many places where chain connectivity is broken. The score (Q) is calculated as Q = N
built + N
placed − 2 × N
chains. Once a model is obtained with an R factor below a pre-set threshold (typically 0.40), then low R factors are used instead of high Q score to identify the best model.
The model-building process begins with (i) building several models into the electron-density map with RESOLVE
). The RESOLVE
model-building procedure uses a convolution-based search for helices and strand fragments in the map and this search gives results that depend on the precise orientations of the helix and strand templates that are used in the search. Consequently, a relatively diverse set of models can be created by simply varying the parameters of this convolution search. Typically, three models are built in the first step of the AutoBuild
model-building procedure. The best model is refined with phenix.refine
as described below, including automatic placement of waters and the use of NCS if present, and all models (refined and unrefined) are used in the next step.
The models created in step (i) above are then combined (ii) into a single model using the RESOLVE ‘extend-only’ model-building procedure. In this procedure, a model or models are cut into overlapping segments (typically ten residues long) and are extended as far as possible into the electron density by RESOLVE model building. The resulting set of overlapping segments is then combined into one or more chains by scoring the segments based on length and fit to density and iteratively extending the highest scoring segment by joining another segment to it, crossing over in a place where two or more sequential Cα atoms in the two segments superimpose within a small distance (typically 1 Å).
Once a ‘best’ single model has been obtained from step (ii) above, attempts are made to improve this model (1) by rebuilding in the region outside the current model and (2) by using two methods to try to fit loops. The rationale for rebuilding in the region outside the current model is that the thresholds for the fit of a segment of a model being built are set based on the overall r.m.s.d. of the map in the region containing the macromolecule. If there are some parts of the molecule that are more poorly defined, then these parts might never be built as the density is not high enough in that region. By masking off the region of the molecule that has been built already, the thresholds can be more reasonably determined for the remaining region containing the macromolecule. Additionally, by focusing on a small region of the map where no model has been built, an extensive search for helices and strands can be carried out in a reasonable amount of computing time. A partial model containing segments of the model that can be built outside the region containing the current model is then added to the current set of working models and is recombined with the other working models as in step (ii) above.
Two methods are used to attempt to build loops. One method is to identify all pairs of C-termini and N-termini of existing chains that are near each other (typically within 15 Å) and to try to extend the C-terminus of the first chain and the N-terminus of the second chain in a way that leads to at least one amino acid overlapping with a low r.m.s.d. for main-chain atoms (typically 1 Å). All such connecting segments that are found are then added as if they were another partial model to the current set of working models as in step (ii) above. The second method for building loops is to use the sequence alignment of the current best model to identify short segments that are missing from the chain and to use the above method to try to fill in the loop. This method differs from the case where the sequence alignment is unknown in that the precise number of amino acids in the loop is known. Once a set of loops has been built, a new model is created by grafting these loops onto the current best model, creating a new model with the loops built. This model is then recombined with the other working models as in step (ii) above.
2.3. Rebuilding an existing model
The AutoBuild wizard has two procedures for rebuilding a model. One is to build a model from scratch exactly as described above, except recombining the best parts of the model to be rebuilt with the new model during that building process. The second procedure for rebuilding a model is quite different; this is the ‘rebuild-in-place’ procedure in which an existing model is rebuilt in segments without adding or deleting residues.
The rebuild-in-place procedure has the advantage that no parts of the model are ‘lost’ in rebuilding, but has the disadvantage that no new model is built. It is best suited to situations where the model is essentially complete and close to correct yet significant local main-chain corrections need to be made to improve the model. The rebuild-in-place procedure is based on the loop-fitting algorithms described above, combined with a procedure for the recombination of two chains that have different conformations but are aligned and have the same residues. The rebuild-in-place option is well suited to the rebuilding of high-sequence similarity models derived from molecular replacement.
In the first step of the rebuild-in-place procedure, the rebuild-in-place method in RESOLVE
is used to sequentially rebuild overlapping segments of the model. A segment, typically six residues long, is removed from the model. The loop-fitting algorithm described above is then used to rebuild this segment, maintaining the identities of the residues in the loop and the length of the loop. During the loop-fitting process, the orientations of the residues at the two ends of the resulting gap are varied slightly by randomizing the coordinates of the main-chain atoms of these residues by a small distance (typically an r.m.s.d. of 0.5 Å). As the loop-residue positions are generated by extending from the last amino acid in the chain, this randomization has the effect of introducing diversity into the loops that are created. If a new loop conformation can be found, it is used to replace the existing loop. If no acceptable conformation is found, the existing loop is maintained. The process is repeated, offsetting the loop building by five residues at a time, until the entire model (except the very ends of each chain) has been rebuilt. In the second step of the rebuild-in-place procedure, the model created by rebuilding overlapping segments is recombined with the original model, taking the best-fitting segments of each model. This crossover process is carried out by aligning the two models, identifying all the places where crossover can occur as corresponding Cα
atoms that are within a small distance of each other (typically 0.5 Å) and choosing whichever model has the higher local map correlation for each segment of the model from one crossover point to the next. Once a recombined model has been obtained, side chains are rebuilt using a highly curated rotamer library (Lovell et al.
) instead of the rotamer libraries originally created for RESOLVE
model building (Terwilliger, 2003a
2.4. Refinement with phenix.refine
A complete description of the phenix.refine
program will be published elsewhere; here, we outline the features used by the AutoBuild
wizard in automated model-building procedures. Depending on the quality of the initial electron-density map, the models undergoing refinement may be quite incomplete and contain significant coordinate and/or displacement parameter errors. Therefore, the methods described here have been designed to be fault-tolerant, a necessary requirement for an automated procedure. Firstly, a robust automatic bulk-solvent correction and anisotropic scaling procedure is used to account for the scattering from disordered solvent in the crystal and to correct for any anisotropic diffraction (Afonine et al.
). Coordinate refinement is performed by LBFGS minimization (Liu & Nocedal, 1989
) of the target function E
with respect to atomic coordinates, while keeping all other parameters fixed. Exyz
can be a least-squares target (LS; Afonine et al.
), an amplitude-based maximum-likelihood target (ML; Afonine et al.
) or a phased maximum-likelihood target (MLHL; Pannu et al.
). In the refinement of atomic displacement parameters (ADPs), the target E
is minimized with respect to isotropic ADPs while all other model parameters are fixed. E
is defined as
is the total number of atoms in the model, the inner sum is extended over all M
in the sphere of radius R
around atom i
is the distance between two atoms i
are the corresponding isotropic ADPs and k
is a user-defined constant. By default, R
are fixed at 5.0 Å and 1.0, respectively, but they can also be refined. The restraint term is scaled to the crystallographic X-ray target by comparison of the X-ray and ADP restraint gradients (Afonine et al.
). This ADP restraint function makes use of the following ideas.
- (i) A bond is almost rigid; therefore, the ADPs of bonded atoms are similar (Hirshfeld, 1976 ).
- (ii) The ADPs of spatially close (nonbonded) atoms are similar (Schneider, 1996 ).
- (iii) The bond rigidity and therefore the difference between the ADPs of bonded atoms is related to the absolute values of the ADPs. Atoms with higher ADPs can have larger differences (Ian Tickle, CCP4 Bulletin Board).
We have implemented a completely automated protocol for updating the ordered solvent model during the refinement process. If requested by the user (and by default in the AutoBuild
wizard), waters are updated (added and removed) in each macro cycle. In the same macro cycle, the complete structure including the updated water structure is subjected to coordinate and ADP refinement. Updating the ordered solvent model involves the following steps.
- (i) Elimination of waters present in the initial model based on user-defined cutoff criteria by ADP, occupancy and inter-atomic distances (water–water, macromolecule–water).
- (ii) Location of peaks in a cross-validated likelihood-weighted difference map (Read, 1986 ; Urzhumtsev et al., 1996 ).
- (iii) Confirmation of peaks found in the previous step using a 2mF
obs − DF
- (iv) Elimination of peaks in regions occupied by the macromolecule as determined by the current bulk-solvent mask.
- (v) Elimination of peaks that are too close to each other (the default minimum distance is 2.0 Å; the strongest peak of two close peaks is retained).
- (vi) Elimination of peaks that are too close to macromolecular atoms (the default minimum distance is 1.8 Å).
- (vii) Elimination of peaks that are too far away from macromolecular atoms (the default maximum distance is 6.0 Å).
- (viii) Elimination of peaks based on the evaluation of tabulated empirical distance distributions derived from the analysis of high-resolution models in the PDB (Afonine et al., 2005a
). Distance distributions between water O atoms and macromolecular C, N and O atoms are tabulated. Only peaks with a good fit to at least one distance distribution are retained.
It is not uncommon for macromolecular crystal structures to have more than one copy of a molecule in the asymmetric unit, generating some form of noncrystallographic symmetry (NCS). This symmetry is exploited in the model-building procedure and can also be used in the refinement of the structures in phenix.refine
. Briefly, the sequence of the input model is subject to pairwise sequence alignment (Needleman & Wunsch, 1970
; Smith & Waterman, 1981
) to identify similar molecules in the model. If any relationships are found, least-squares superposition of the structures is performed (Kearsley, 1989
) and the coordinate deviation is calculated. If the root-mean-square deviation between the coordinates is below a user-specified tolerance (default 3.0 Å), NCS restraints are applied to the related coordinates during structure refinement. The default NCS restraints are very tight (0.1 Å r.m.s.d. for both main-chain and side-chain NCS-related pairs).
2.5. Iterative model building beginning from an experimental map
wizard has one procedure for initial iterative model building beginning from an experimental electron-density map and a second procedure for iterative rebuilding and completion of an initial model. These procedures are based in part on the ‘build’ and ‘rebuild’ procedures in the RESOLVE
model-building script (Terwilliger, 2003b
), although they contain additional steps, as described above.
The procedure for model building from an experimental electron-density map consists of cycles of two basic steps. These steps are (i) using experimental phase information and any additional phase information available in statistical density modification (Terwilliger, 2000
) to create a new working electron-density map and (ii) building and refining a model based on this new map as described in §
For density modification, several sources of additional phase information are used when available. One is any noncrystallographic symmetry information (NCS) as implemented in RESOLVE
). NCS is deduced from the coordinates of heavy-atom sites if available and also directly from the current atomic model of the macromolecule as described above if the sequence has been aligned. A second source of information is the presence of recognizable local patterns of density in the electron-density map (Terwilliger, 2003c
) and a third is the presence of density matching a helical or strand template in the map (Terwilliger, 2001
). A fourth source of information consists of any partial models of the macromolecule that have been built. For the purpose of identifying patterns in the electron-density map, a composite omit map is produced each cycle in which model information is excluded from the omitted region (Terwilliger, 2003b
The approach used to carry out density modification in this ‘build’ procedure has several steps. Firstly, electron-density information from local patterns of density and helical and strand locations are combined. Both the identification of local patterns of density and identification of helical and strand fragment procedures result in a pseudo-electron-density map with density that has some information about the true electron-density map. Relative weights for these maps are chosen such that the weighted average pseudo-electron-density map has the highest possible correlation with the current working map. The resulting pseudo-electron-density map is then used as a target for statistical density modification in the same way that NCS and model-based information is incorporated, except that the uncertainty associated with this target map is arbitrarily set to a very high value (typically the r.m.s.d. of the current electron-density map) so as not to overly emphasize this information (Terwilliger, 2003b
The phase probability distributions obtained are then used as prior phase information in a second density-modification step that includes model information as well as any NCS and solvent-flattening information (Terwilliger, 2003b
). The models obtained in any previous cycles are used to calculate a composite target model map (Terwilliger, 2003b
) and the target model map is scaled to match the working map as closely as possible, including only grid points near the positions of atoms in the model (typically within 2.5 Å). The r.m.s.d. between the working map and this target map at these grid points is used as the uncertainty for the values in the target map in statistical density modification (Terwilliger, 2003b
). The map obtained from this statistical density-modification procedure is then used for model building.
2.6. Iterative density modification, model building and refinement beginning from a model
The AutoBuild wizard procedure for iterative model building beginning from a partial model is similar to the procedure starting from experimental phase information, but there are differences resulting from the fact that the phases available are from a partial model. Owing to model bias, the methods for identifying local patterns of density and for finding helices and strands used when starting from experimental phase information are not effective and are therefore skipped. Additionally, the starting map used in the final density-modification step comes from the model and not from experimental phases.
The procedure for density modification beginning from a model uses model-based phase probabilities as the starting point for density modification. A composite target map is calculated from any models available from previous cycles, just as in the procedure described in §
2.4. This map is then used as a target for statistical density modification, using the same procedure for calculating uncertainties in the target density that was used for the incorporation of pattern-based density information in §
simply using the r.m.s.d. of the map as the uncertainty). The resulting phases and map are then used in statistical density modification including NCS and solvent flattening, yielding a density-modified model-based electron-density map. This map is then used as a starting point for density modification that includes model information as well as any NCS and solvent-flattening information as described in §
2.4. The prior phase probabilities for this density-modification step consist only of any experimental phase information that is available (so there is no prior phase information in cases where the rebuilding is performed with no experimental phase information).