Conseils de recherche
Les critères de recherche 


Logo of bioinformLink to Publisher's site
Bioinformation. 2009; 3(6): 268–274.
Published online 2009 January 12.
PMCID: PMC2646862

Functional group based Ligand binding affinity scoring function at atomic environmental level


Use of knowledge based scoring function (KBSF) for virtual screening and molecular docking has become an established method for drug discovery. Lack of a precise and reliable free energy function that describes several interactions including water-mediated atomic interaction between amino-acid residues and ligand makes distance based statistical measure as the only alternative. Till now all the distance based scoring functions in KBSF arena use atom singularity concept, which neglects the environmental effect of the atom under consideration. We have developed a novel knowledge-based statistical energy function for protein-ligand complexes which takes atomic environment in to account hence functional group as a singular entity. The proposed knowledge based scoring function is fast, simple to construct, easy to use and moreover it tackle the existing problem of handling molecular orientation in active site pocket. We have designed and used Functional group based Ligand retrieval (FBLR) system which can identify and detect the orientation of functional groups in ligand. This decoy searching was used to build the above KBSF to quantify the activity and affinity of high resolution protein-ligand complexes. We have proposed the probable use of these decoys in molecular build-up as a de-novo drug designing approach. We have also discussed the possible use of the said KSBF in pharmacophore fragment detection and pseudo center based fragment alignment procedure.

Keywords: ontology, small molecule, functional group, pharmacophore, semantic similarity, scoring function, knowledgebase, Distance based scoring


The number of publicly available protein structures in the Research Collaboratory for Structural Biology database has grown to more than 30,000 structures with thousands of them being added each year. In addition the number of structures of small molecules available in public databases and proprietary databases has reached into the millions. This wealth of available data raises the question of how it can be best used to assist in drug design and discovery. Moreover the process of novel lead finding for a new drug target has became the most important and undoubtedly one of the most crucial steps in a drug development program. These days researchers are following mostly two complementary strategies 1) experimental high-throughput screening to discover possible leads from large compound libraries, and 2) computational methods exploiting structural information of the protein binding site to discover new lead by virtual screening of large databases [1,2, 3,4]. The virtual screening approaches try to predict the actual binding mode of a ligand at the binding site by scoring each possible binding mode through docking. The in silico virtual screenings are useful because they are fast enough to scan over several hundred to thousand compounds [5,6]. Through virtual screening we can rank the possible modes of ligand binding and also can predict the Gibbs free energy of binding, provided that the structural information of receptor is known and the scoring function is good enough to do so. Usually the performance of such methods is determined by assessing whether the binding geometry of protein-ligand complexes resolved by X-ray crystallography or NMR is reproduced. This validation criterion imposes some preconditions onto the methods being developed because, of the availability of limited resolution protein-ligand complexes.

There are two broad categories of scoring functions. The first category of functions are largely based on some aspects of the known physics of molecular interaction, such as the van-der Waals force, electrostatics, and the bending and torsional forces, to determine the energy of a particular conformation [712]. The second categories of functions are knowledge-based. Each of these knowledge-based functions tries to capture some aspects of the protein- ligand complex native conformations, such as the tendency of a certain amino acid to be exposed or buried relative to the solvent and its distance from interacting groups of ligand. These knowledge-based functions are compiled based on the statistics of a database of experimentally determined protein-ligand complex structures [1323].

Interaction between these two categories of functions has resulted in a fertile ground for the experimentation and construction of new scoring functions. The distance based scoring function existing in today's paradigm take atom as a single moiety and hence missing the environmental effect of it. We formulate and analyze an analogous knowledge-based scoring function which involves the distance of functional group from triplets of residues in a protein conformation. The functional group scoring took in to account the environmental effect of atoms and hence considering functional group singularity concept unlike atom singularity discussed in above methods. We also investigate the effect of using various approaches for compiling the prior distribution on the performance of the knowledge-based function.

We first briefly review the existing knowledge based scoring function approaches. We then describe the construction of a knowledge-based scoring function which incorporates environmental effect of atom under consideration. The performance of the proposed knowledge based function in protein-ligand binding affinity study. Finally, we propose some possible extensions to the current form of the scoring function.

Theoretical background and Methodology adopted

Existing methodologies

Knowledge-based potentials have been applied successfully to rank different solutions of the protein-folding problem [2426]. This approach has also been applied to several case studies for the ranking of different protein-ligand complexes. None of these, however, engaged in identifying environmental effect of ligand atoms from which the distances were calculated to residues triplet of protein. Wallqvist and co-workers [27,29] classified the surfaces of buried ligand atoms found in 38 complexes to develop a model for Gibbs free energy prediction of binding based on these observed atom-atom preferences. Analyzing ten HIV protease inhibitor complexes, they approximated the free energy of binding to an accuracy of ± 1.5 kcal/mol. Verkhivker and coworker [29] using a data set of 30 HIV-1, HIV-2, and SIV proteases, compiled a distance- dependent knowledge-based pair-potential which was then combined with the hydrophobicity [30] and conformational entropy scales [31] that originally had been developed to explain protein folding and stability. Muegge and Martin [32] explored structural information of known protein-ligand complexes from the PDB and derived distance-dependent Helmholtz free interaction energies of protein ligand atom pairs. Tested on 77 protein-ligand complexes, the calculated score displayed a standard deviation from the observed binding affinities of 1.8 log Ki units. The scoring function was further evaluated by docking weak-binding ligands to the FK506-binding protein [33]. Similarly DeWitte and Shaknovich [34] used a sample of 126 structures from the PDB to develop a set of inter-atomic interaction free energies for a variety of atom types. Gohlke et. al. [35] has studied 91 protein-ligand PDB complexes to calculate a new scoring function to discriminate and predict ligand-binding modes in multiple solutions.

Limitation of existing methodologies and our approach

All of the existing approaches discussed above calculate the distances between ligand atoms in consideration to the functional moieties within target active site. The functional moieties of active site can be the whole residue triplet or only an individual atom of interacting amino acids (Figure 1). In all cases distances were taken and calculated over series of high resolution Protein-ligand complexes to come up with some consensus distance score, each with a measured standard deviation.

Figure 1
Atom Singularity concept: each atom treated as individual entity irrespective of the functional groups

But one basic problem associated with these above set of discussed scoring function is tough it has statistical account for all distances measure, it neglects the atomic environment. From Figure 2 it is self evidenced that the Oxygen in Acid (A) and Ketone (B) group will have different electronic cloud distributions associated with it. So it should be treated separately unlike the atom singularity concept (single distance measure) in earlier distance based scoring functions. We have taken this in to account in our study to design a scoring function which is functional group based. Hence atomic environmental effects have been considered, in our functional group singularity concept.

Figure 2
Oxygen in different environment, (A) carboxylic acid and (B) ketone

Identification of functional group

We have designed a tool which is able to automatically detect and assign functional group (FG) information to any given small molecule [36]. Theoretically it is able to identify any functional group though only 210 different functional groups have been annotated from our ligand dataset. Assignment of FG is made strictly based on computational detection of specific arrangement of atoms and bond with in the input molecular structure. A given ligand structure may have any number of FGs assigned to it and the detection of these FG were carried out with specific coordinate information.

Ligand files preparation

With an aim to make our functional group study more meaningful and ontology designing more application oriented, we have specifically taken selected set of protein-ligand complexes with high resolution (more than 2 angstrom) from Protein Data Bank (PDB) as the input set for our study. The input set of PDB file has coordinate information about the protein and ligand bound to it. These PDB file have no information about the atomic connections hence, we have converted the files to MOL format. MOL format file has atomic as well as connection information with all specific relevant information of PDB file restored. We have specifically extracted the heteroatom information of PDB file with row id. HETATM and its connection information from MOL file.

Coordinate information of Water molecules are explicitly removed and only small molecule coordinates information were taken. Adjacency matrix was developed from the above flat file information and further processed for functional group identification. The highest bond order priority based link finding was done from the matrix data. Corresponding atom information with each bond order was carried out to categorize each probable functional group in given ligand. Information was mapped to chemical tree for functional group identification and annotation in descending order of bond order. Annotation key were developed and used in FGO to identify and represent the functional group along with coordinate and other information.

Functional group ontology

The Functional group ontology (FGO) designed to treat each functional group as an abstract super class with each atom of it as an object [36]. The detail information of atom like and its connection information with parent atoms were stored in a single class. Such classes were hierarchically arranged to denote each functional group. Refer supplementary material for parameter definitions.


In total 250 proteins structure has been studied to estimate ΔP for each set of available functional groups. The distance between CG of functional group and interacting functional point of active site triad were measured with in cutoff of 12 Å. Here we have made a comparative study between oxygen of carbonyl group in ketones and carboxylic acids functional groups. Out of selected 250 high resolution structures only 169 was to found to have one or more of the above mentioned functional groups. Distances of calculated radii for statistical potential function for these groups were calculated and frequency distribution plot for both functional groups has been shown in Figure 5 and and66 respectively.

Figure 5
Distance frequency distribution for ketone functional group in protein active site
Figure 6
Distance frequency distribution for carboxylic functional group in protein active site

From above Figure 5 and and66 it's clearly evidenced that though both the functional group in study contains carbonyl moiety explicitly, but due to environmental difference in both cases the mean distance ’r‘ differs in both case. The carboxylic acid is having distance radii between 2.75 Å to 2.95 Å where as ketone group is falling between 2.35 Å to 2.64 Å. It suggests a strong protein-ligand interaction prevails in with ketone group in comparison to acid group in protein active site pocket. It can be further evidenced as the carboxylic acid group is having one oxygen atom in neighborhood of carbonyl group hence the strong electonegetivity of oxygen atom make the electronic cloud over carbonyl group less dense to form a weak interaction with responsible amino acid residue in active site pocket. Whereas absence of electronegative atom in neighborhood of ketone functional group makes it to interact with active site stronger than that of carboxylate.

This work has an account of responsible functional group fragment hence can be used further in de-novo drug designing, where molecular buildup approach can be used to generate the new lead for a receptor. Orientation profile of all functional group is also represented in the ontology said hence can be used to explicitly use the orientation placement of these blocks in receptor cavity to build the final lead in de-novo build-up approach. Semantic similarity discussed above can be used for pharmacophore matching and database searching algorithm and also it can be used for structural pharmacophore clustering of ligand molecules. The diversity management in virtual screening can be also addressed by using the semantic similarity score.


In this study, we have constructed and analyzed a functional group based distance knowledge-based scoring function. The scoring function is inspired by the previous work of Gohlke et. al. [35], who designed a atomic distance based scoring function new scoring function to discriminate and predict ligand-binding modes in multiple solutions. Our formulation of the functional group ligand retrieval system can retrieve ligand functional group fragments and can represent it by uniquely designed ontology. This ontology has also exact coordinate information to track the specific orientation of ligand binding mode. We have used the ligand functional group to create a pseudo centre distance based scoring function which can used as distance-dependent potential function of Protein-ligand binding. In our numerical distance approximation experiments we have used an r-min of 1 Å and r-max of 12 Å to accommodate bulky functional group orientation to produce good results. It took atomic environment in to consideration and hence more accurately reveal and incorporate the ligand electronic cloud distribution in to scoring function. We have taken a comparative assessment between carbonyl atom of ketone and acid and the result shown above support the fact that, the environmental effects make the two set of carbonyl atom quite distinct from each other. This statistical distance potential can be used for measuring ligand-protein binding affinity.

Figure 3
Flow chart of FGO generation
Figure 4
Class architecture of single atom used in FGO

Supplementary material

Data 1:


1. Kubinyi H. Curr Opin Drug Discov Devel. 1998;1:4. [PubMed]
2. Muller K. Perspectives in Drug Discovery and Design. 1995;3:234.1.
3. Van Drie JH, Lajiness MS. Drug Discov Today. Vol. 3. 1998. p. 274.
4. Walters WP, et al. Drug Discov Today. 1998;3:160.
5. Kuntz ID, et al. J Mol Biol. 1982;161:269. [PubMed]
6. Lengauer T, Rarey M. Curr Opin Struct Biol. 1996;6:402. [PubMed]
7. Brooks B, et al. J Comput Chem. 1983;4:187.
8. Jorgensen W, Tirado-Rives J. J Am Chem Soc. 1988;110:1657.
9. Nemethy G, et al. J Phys Chem. 1992;96:6472.
10. Cornell WD, et al. J Am Chem Soc. 1995;117:5179.
11. Weiner S, et al. J Comput Chem. 1986;7:230.
12. MacKerell AD, et al. J Phys Chem. 1998;102:3586. [PubMed]
13. Wodak S, Rooman M. Curr Opin Struct Biol. 1993;3:247.
14. Gilis D, Rooman M. J Mol Biol. 1996;257:1112. [PubMed]
15. Moult J, et al. Proteins. 1997;29:2. [PubMed]
16. Moult J, et al. Proteins. 1999;37:2. [PubMed]
17. Moult J, et al. Proteins. 2001;45:2. [PubMed]
18. Moult J, et al. Proteins. 2003;53:334. [PubMed]
10. Samudrala R, Levitt M. BMC Struct Biol. 2002;2:3. [PMC free article] [PubMed]
20. Samudrala R, Moult J. J Mol Biol. 1998;275:895. [PubMed]
21. Samudrala R, et al. Proteins. Vol. 3. 1999. p. 194. [PubMed]
22. Dunker K, et al. Proceedings of the Pacific Symposium on Bio-computing, World Scientific Press; Singapore, 505.
23. Sipp LM. Curr Opin Struct Biol. 1995;5:229. [PubMed]
24. Jernigan RL, Bahar I. Curr Opin Struct Biol. 1996;6:195. [PubMed]
25. Vajda S, et al. Curr Opin Struct Biol. 1997;7:222. [PubMed]
26. Torda AE. Curr Opin Struct Biol. 1997;7:200. [PubMed]
27. Wallqvist A, Covell DG. Proteins: Struct Funct Genet. 1996;25:403. [PubMed]
28. Wallqvist A, et al. Protein Sci. 1995;4:1881. [PubMed]
29. Verkhivker G, et al. Protein Eng. 1995;8:677. [PubMed]
30. Sharp KA, et al. Biochemistry. 1991;30:9686. [PubMed]
31. Pickett SD, Sternberg MJ. J Mol Biol. 1993;231:825. [PubMed]
32. Muegge I, et al. J Med Chem. 1999;42:2498. [PubMed]
33. Muegge I, Martin YC. J Med Chem. 1999;42:791. [PubMed]
34. DeWitte RS, Shaknovich EI. J Am Chem Soc. 1996;118:11733.
35. Gohlke H, et al. J Mol Biol. 2000;295:337. [PubMed]
36. Kumar VP, et al. Bioinformation. 2007;2:113. [PMC free article] [PubMed]
37. Berman HM, et al. Nucleic Acids Research. 2000;28:235. [PMC free article] [PubMed]

Les articles de Bioinformation ont été offerts à titre gracieux par Biomedical Informatics Publishing Group.