Modelling data set
In order to develop and test the approach twenty four families were chosen from the HOMSTRAD database [
23], representing each of the four main SCOP classes (all α, all β, α + β and α/β). For each family five members were chosen based on maximizing the range of the relative percentage identity (PID) based on sequence (calculated by Malform [
35]) and ensuring all the solved structures were of relatively high resolution (greater than 2Å). One member was designated as the target, with the rest acting as the templates. This allowed fifteen combinations of the templates exploiting one to four homologues as targets, so reflecting information from homologues across the range of PID. The data set was sub-divided into three. The first consisted of four families that were used to define the default parameters. The restraint defaults for main chain and side chain restraint sphere size were chosen by iteratively reducing the radii in a combinatorial manner until RAPPER was unable to generate a model. The second set comprising a further six families was used to generate all 15 combinations of template to target. The third set comprises all of the chosen families and were used to test alternative approaches to defining restraints. Table shows the families and their constituent members. The possible combinations of target to templates are given in Table . Each of the combinations, including the target, were structurally aligned using COMPARER [
36] and annotated by JOY [
37]. The resulting alignments were manually corrected, resulting in the best possible alignment and thus minimising any error from an incorrect alignment.
Modelling procedure for RAPPER
The application of the conformational search engine RAPPER to comparative modelling by satisfaction of spatial restraints was achieved by extending the restraint engine as described for solving the Cα trace problem [
19]. From the given alignment a structural superimposition of equivalent residues is made and optimised. A common core was defined from the set of aligned protein structures as the subset of equivalent residue atoms with relatively little structural variation as defined by the Altman-Gerstein algorithm [
38] and implemented in RAPPER. Based on this superimposition and alignment, spatial restraints can then be described for each residue of the target sequence. There are four types of spatial restraint:
1 – As RAPPER builds from the N to C termini a bootstrap restraint is required to allow modelling to commence. The bootstrap is defined as the mean position of the Cβ coordinates from the templates, which is made the centre of a restraint sphere, the size of which is user-defined. In building the first two residues a position of the first residue Cβ is taken at a random offset from the mean Cβ coordinate position of the equivalent Cβ of the templates. From this the remaining backbone atom positions can be calculated from the ideal Engh and Huber [
39] bond angles and lengths implicit in the RAPPER protein model. A ψ angle is then randomly picked from high-grained residue specific ϕ/ψ propensity tables as well as a random angle for the vector between the first and the second Cβ position. Thus the first peptide bond is generated.
2 – A set of spatial restraints is defined for the backbone (main chain) atoms, principally the Cα atoms. Each is defined as an ellipsoid generated from the union of the set of restraint spheres centred on the equivalent atom position from each of the templates, as defined in equation 1. The size of these spheres is user defined.
where

is the position of the Cα atom,

is the centre of the restraint sphere with radius r.
3 – A similar set of spherical restraints can be defined for the side chain atoms, except that, rather than taking each atom separately, a virtual centroid (as defined in equation 2) of the side chain is calculated and this position is used to centre the restraint sphere. In fact two virtual centroid positions are calculated: a short virtual centroid position which essentially takes into account the atoms up to and including the Cγ position and a long virtual centroid position which accounts for the rest of the side chain.
where Nsc is the number of side chain atoms
4 – A set of spatial restraints is derived for secondary structure elements. Residues are defined to be in elements of secondary structure from consideration of the consensus across the template structures or from secondary structure prediction. The restraints are a combination of restricted ϕ/ψ sampling of the residue specific ϕ/ψ propensity tables to the alpha helical or beta sheet regions of ϕ/ψ space and short range hydrogen bonding distance restraints. Only short range hydrogen bonding is enforced and this primarily in alpha helical regions, although we have now developed algorithms for including more long range restraints (A Karmali and N Furnham, unpublished data).
As well as the specific restraints from homologues, a number of other restraints are also enforced including clash restraints against the framework structure as it is built and distance restraints from ideal bond angles, bond lengths and omega torsion angles. All of the restraints can be propagated along the chain for a user defined distance.
The standard building process in RAPPER as described previously is used [
18,
19]. Briefly, the algorithm employs a branch and bound protocol to extend iteratively the polypeptide chain in the N to C-terminal direction. A population of 100 fragments that make up the growing polypeptide chain is maintained, with a maximum of 100,000 attempts to find the 100 solutions to the restraint network at each residue position. As some residues are in rare ϕ/ψ conformations this may still be insufficient to sample effectively the ϕ/ψ space. Thus, to optimise the time spent searching the target sequence is split into a number of fragments, avoiding regions where there is no template information available, but otherwise randomly. A population of 50 models is produced for each target. The geometric average of the model population is calculated in RAPPER. The resultant single model is then re-geometrised by TINKER [
40]. The protocol is summarised in Figure .
Models were constructed using this standard comparative modelling mode. In each round of building 2Å spheres where enforced for the bootstrap, Cα main chain and side chain restraints. These values were determined from the subset of four families used to parameterise the modelling procedure. This parameterisation was achieved by iterative rounds of building adjusting each of the parameters in a combinatorial approach, starting from a large value and gradually decreasing in 0.5Å increments till the restraints were too strict for a model to be built. The last round where the model could be successfully generated was taken as the optimal parameters.
RAPPER sampling by PID
The results of modelling using all the templates demonstrate that the approach would benefit from restricting the available search area. This can be simply achieved by weighting towards the restraints derived from the template with the highest PID to the target, which is accomplished by reducing, based on the PID of the template to target, the relative size of the restraint spheres. The range of PID across the available templates is calculated and is divided into four equal sub-ranges. If the PID of the template lies in the top quartile then the user defined restraint sphere radius is enforced. If the PID of the template lies in one of the other three quartiles, then the restraint sphere is reduced by a corresponding factor, with the restraint spheres generated from the template whose PID lies in the lowest quartile being reduced by 60%. In addition the sampling frequency of the restraint sphere generated from the template with the highest PID is enhanced.
RAPPER using probability density function derived restraints
More distantly related homologous structures can be exploited if restraints are formulated as probability density functions (PDF). The position of each atom (or centroid for side chains) can be used to centre a probability function described as a Gaussian distribution, the mean of which is the atom position and the variance is the local PID taken over a window of 20 residues as a
where i is the position in the template sequence, x1 is Cα position of the template and σ12 is inversely proportional to the PID of the template. The sum of the distributions of each of the homologous atom positions is calculated and normalised to generate a PDF (equation 4).
where x is the coordinate in question and t is the template. This is done for each of x, y and z coordinates. The resulting mean position of the combined PDF is taken as the centre of the restraint sphere, the radius of which can either be user defined or defined by the standard deviation of the new distribution for each coordinate, which can then be used to define an ellipsoid (see Figure ).
RAPPER using CHORAL/ANDANTE predictions
An alternative approach to defining restraints based upon information from homologous structures can be achieved by taking advantage of the predictions of two programs: CHORAL [
31] and ANDANTE [
32]. CHORAL, an amalgam of differential geometry and pattern recognition algorithms, identifies the clusters of conformers from homologous templates with conserved curvature and torsion that are most likely to represent the core backbone of the target structure. ANDANTE uses environmental-specific substitution probabilities to predict where χ1, χ1 plus χ2, or χ1 plus χ2 plus χ3 can be directly used from a single template to limit the rotamer search space. Thus, RAPPER uses the equivalent template residue(s) predicted to contribute either to the target's core backbone or side chain conformations to generate the restraint network. For example, if CHORAL predicts that residue
i in the target sequence will have similar backbone conformations to the equivalent residues of template 1 and template 2, the Cα atoms of these two templates are used as the centres of the main chain restraint spheres. Similarly, where ANDANTE predicts that the χ1 plus χ2 of template 2 is most likely to be conserved in the target, the short virtual centroid position is used as the centre of the short side chain restraint sphere. RAPPER then builds through this restraint network in the same manner as the standard method for restraint derivation.
For each target the protocol in the standard comparative modelling procedure is used to produce an ensemble of 50 models; the arithmetic mean is taken and the structure re-geometrised using TINKER [
40]. The approach of using CHORAL/ANDANTE predictions allowed tighter restraints of 1Å radius to be universally enforced for both main chains and side chains. Where CHORAL or ANDANTE did not predict conformations for a residue i.e. a variable loop region or where there was no prediction of side chain rotamer, all of the templates were used to generate the restraint network with the larger 2Å radius. The restraint sphere radius in the interface between the conserved core and non-conserved region for the backbone was "funnelled" at the end of the conserved core region (gradually increasing from 1Å to 2Å) and the beginning of the next conserved core region (gradually decreasing from 2Å to 1Å). This provided continuity in the main chain restraint network, ensuring no unrealistic distances were required to be satisfied.
Baseline Modelling
In addition to the basic comparative mode of RAPPER, further models were constructed in order to estimate the limitations of the method. For example we used the Cα trace mode of RAPPER [
19] to rebuild the target based on experimentally observed co-ordinates. We also exploited restraints from secondary structure information, using the actual atomic positions of the Cα atoms of the experimentally resolved target to define the restraint network. Alternatively the template with the minimum distance from its Cα to that of the target was used while ensuring that this was consistent with the previous restraint sphere centre by approximately a Cα-Cα bonds length to define restraints.
Other modelling programs
The targets were also built using the well established comparative modelling program: MODELLER [
41]. Ten models were produced by MODELLER using the standard model-building routine. A single model was automatically selected based on the average between the minimal energy as calculated by MODELLER and minimal steric violations.