Our implementation of the substructure search avoids the use of any SI algorithms, as the incorporation of tolerance to variations in atom identities and interatomic distances is not readily compatible with these algorithms. Instead, we employ an iterated sorting and filtering scheme, which first scans the target structure and collects those atom pairs that have an equivalent pair of atoms with matching distance as in the query structure. These collected candidate pairs are then used to construct candidate substructures, followed by the selection of the best matching substructures based on their weights. Substructure weight is defined as the geometric mean of weights Wi
of all N
atom pairs in the substructure, multiplied by an additional penalty (1 − wj
), where wj
is the weight of a missing atom, to account for up to M
of such atoms:
The tolerance to deviations in atom pair distances is an important requirement, as the available crystal structures have limited resolution and the conformation of a binding pocket undergoes constant thermal fluctuation. Available computational power imposes a practical limit on the tolerance that can be handled by our algorithm; a higher tolerance results in a larger number of matches, and therefore an increased structure processing time. We show that we can accurately locate the binding pockets for small molecules (drugs, poisons and ADP), scaffolds for metal ions and, in certain cases, the binding sites of short peptides. To illustrate these results, we have prepared a test query structure using the following seven atoms of the tetracycline binding pocket from tetracycline repressor (PDB ID 1BJ0): H64 NE2, N82 ND2, N82 OD1, F86 CE1, F86 CE2, H100 NE2 and Q116 NE2. We perform a search for this substructure with a tolerance of σ = 2 Å on the entire PDB, current as of October 12, 2010, which at this time contains a total of 20 tetracycline repressor structures and 68 000 protein structures (2 × 105
models) overall. Our method locates 18 of these structures (), and it is interesting to note that not all of these structures have tetracycline bound. The two tetracycline repressor structures (2NS7 and 2NS8) that we did not identify have active site conformations very different from other tetracycline repressors, most likely caused by mutations introduced in residues near the binding site.
Finally, we point out that our method is not limited to the detection of surface features, and can also detect buried substructures. By providing information on proteins containing specified atoms and residues in the given spatial arrangement, Erebus may serve as the first step in many structure analysis protocols.
Funding: National Institutes of Health (grant numbers R01GM080742, ARRA supplements GM080742-03S1 and GM066940-06S1 to N.V.D.).
Conflict of Interest: none declared.