The number of publicly available protein structures in the Research Collaboratory for Structural Biology database has grown to more
than 30,000 structures with thousands of them being added each year. In addition the number of structures of small molecules available
in public databases and proprietary databases has reached into the millions. This wealth of available data raises the question of how it
can be best used to assist in drug design and discovery. Moreover the process of novel lead finding for a new drug target has became the
most important and undoubtedly one of the most crucial steps in a drug development program. These days researchers are following mostly
two complementary strategies 1) experimental high-throughput screening to discover possible leads from large compound libraries, and
2) computational methods exploiting structural information of the protein binding site to discover new lead by virtual screening of
large databases [
1,
2,
3,
4]. The virtual screening approaches try
to predict the actual binding mode of a ligand at the binding site by scoring each possible binding mode through docking. The in silico
virtual screenings are useful because they are fast enough to scan over several hundred to thousand compounds
[
5,
6]. Through virtual screening
we can rank the possible modes of ligand binding and also can predict the Gibbs free energy of binding, provided that the structural
information of receptor is known and the scoring function is good enough to do so. Usually the performance of such methods is determined
by assessing whether the binding geometry of protein-ligand complexes resolved by X-ray crystallography or NMR is reproduced. This
validation criterion imposes some preconditions onto the methods being developed because, of the availability of limited resolution
protein-ligand complexes.
There are two broad categories of scoring functions. The first category of functions are largely based on some aspects of the known
physics of molecular interaction, such as the van-der Waals force, electrostatics, and the bending and torsional forces, to determine
the energy of a particular conformation [
7–
12].
The second categories of functions are knowledge-based. Each of these knowledge-based functions tries to capture some aspects of the
protein- ligand complex native conformations, such as the tendency of a certain amino acid to be exposed or buried relative to the
solvent and its distance from interacting groups of ligand. These knowledge-based functions are compiled based on the statistics of a
database of experimentally determined protein-ligand complex structures
[
13–
23].
Interaction between these two categories of functions has resulted in a fertile ground for the experimentation and construction of
new scoring functions. The distance based scoring function existing in today's paradigm take atom as a single moiety and hence
missing the environmental effect of it. We formulate and analyze an analogous knowledge-based scoring function which involves the
distance of functional group from triplets of residues in a protein conformation. The functional group scoring took in to account the
environmental effect of atoms and hence considering functional group singularity concept unlike atom singularity discussed in above
methods. We also investigate the effect of using various approaches for compiling the prior distribution on the performance of the
knowledge-based function.
We first briefly review the existing knowledge based scoring function approaches. We then describe the construction of a
knowledge-based scoring function which incorporates environmental effect of atom under consideration. The performance of the proposed
knowledge based function in protein-ligand binding affinity study. Finally, we propose some possible extensions to the current form of
the scoring function.