A library of real protein fragments of arbitrary length is employed to interpret electron density and correct existing models. In order to support both interactive graphical model building (where users demand immediate feedback) and automated model building (where many possible model fragments may need to be tested to match a particular feature), it must be possible to perform a very rapid search for fragments containing some atoms matching a desired conformation.
For example, to fit the main-chain atoms to a Cα trace the database will be searched for all six-peptide fragments matching the Cα atoms surrounding a particular peptide bond and the peptide atoms from the middle peptide of the best-fitting fragment will be used to provide the main-chain atoms for that peptide group. Similarly, to build a missing loop in a protein structure a search will be performed for all fragments for which the initial and final pairs of Cα atoms in the fragment may be superimposed on the last two Cα atoms before the break and the first two Cα atoms after the break.
A library has therefore been constructed using the 500 well refined protein structures of the Richardsons’ ‘Top 50’ database (Lovell et al.
), excluding residues for which the temperature factors of the Cα
atoms exceed 40 Å2
. This provides a database of 106 295 amino acids in 1327 continuous fragments. For each amino acid, the residue type and the coordinates of the N, Cα
and C atoms are stored (in turn providing sufficient information to locate the Cβ
and O atoms). The entire database is stored as a single list of amino-acid records.
The most frequent type of search which will be performed on the database is to find all fragments for which some (possibly discontinuous) set of Cα atoms superpose well on the Cα atoms of some search fragment. The search fragment is in turn provided as a list of amino-acid records, with null records inserted as placeholders to represent residues for which the location is unknown. Thus, to search for a missing loop of four residues, an eight-residue search fragment is constructed from the two residues before the missing loop, four null residues and the two residues after the missing loop.
Performing a least-squares superposition for every fragment in the database would be computationally demanding, so an initial pre-selection phase is performed to produce a subset of fragments which may be good matches to the search fragment. This pre-selection involves a computationally cheaper distance-matrix score.
In order to minimize the computational overhead, distance matrices for the search fragment and for the database are precalculated. For the search fragment, a triangular matrix is calculated with the first row giving the distances from the first Cα to the remaining n − 1, the second row the distances from the second Cα to the remaining n − 2 and so on. The columns of this matrix correspond to the diagonals of the upper triangle of a conventional distance matrix (illustrated in Fig. 1). If an atom is missing, the distance is set to a negative flag value.
Running distance-matrix representation of a single fragment, where D
ij is the distance between the ith and jth Cα atoms. The shaded cells are those available for loop fitting using only two Cα atoms at each end of the fragment.
For the database of n
db residues, an n
db × 20 rectangular ‘running distance matrix’ is pre-calculated, with each row giving distances from the first Cα to the following 20, thus representing fragments of up to 21 residues. This is illustrated in Fig. 2 for a reduced width of six residues. Any distances which span chain boundaries are set to the flag value.
Figure 2 Running distance-matrix representation of the protein-chain database, where D
i,j is the distance between the ith and jth Cα atoms. The shaded cells are those which would be used to score the fit of a search fragment against a particular range (more ...)
In order to identify a set of possibly matching fragments, all that needs to be done is to compare the non-missing values in the fragment distance matrix to the corresponding values obtained by starting from each row of the database distance matrix in turn. A sum of squared differences is used to identify likely matches.
To further optimize the calculation, the sum-of-squares calculation may be terminated early as soon as the sum exceeds a threshold value. The threshold value is controlled by a parameter which determines how many matches will be returned and is updated regularly by sorting the current list of matches, truncating to the desired number and setting the threshold to the value of the worst remaining match.
The limitation of the distance-matrix score is that the distance matrix of a set of coordinates is invariant under inversion of these coordinates through a centre of symmetry, and so the initial search also returns fragments which are the inverse of the search fragment. The resulting list of candidate fragments must therefore be re-scored using a full least-squares superposition and r.m.s. difference calculation. The resulting list is resorted according to the r.m.s. difference.
For some purposes it may be desirable to restrict the search to fragments for which the sequence obeys some criterion, for example to take into account the different main-chain conformations which can occur around Gly or Pro. This is achieved by allowing a mask of 20 binary digits to be set for each position in the search fragment, indicating which of the 20 amino-acid types are allowed to appear at that position in the fragment. This provides an additional restriction on the search results which may be evaluated by simple logical operations.