|Home | About | Journals | Submit | Contact Us | Français|
Three-dimensional structural information is critical for understanding functional protein properties and the precise mechanisms of protein functions implicated in physiological and pathological processes. Comparison and detection of protein binding sites are key steps for annotating structures with functional predictions and are extremely valuable steps in a drug design process. In this research area, MED-SuMo is a powerful technology to detect and characterize similar local regions on protein surfaces. Each amino acid residue’s potential chemical interactions are represented by specific surface chemical features (SCFs). The MED-SuMo heuristic is based on the representation of binding sites by a graph structure suitable for exploration by an efficient comparison algorithm. We use this approach to analyze one particular SCOP superfamily which includes HSP90 chaperone, MutL/DNA topoisomerase, histidine kinases, and α-ketoacid dehydrogenase kinase C (BCK). They share a common fold and a common region for ATP-binding. To analyze both similar and differing features of this fold, we use a novel classification method, the MED-SuMo multi approach (MED-SMA). We highlight common and distinct features of these proteins. The different clusters created by MED-SMA yield interesting observations. For instance, one cluster gathers three types of proteins (HSP90, topoisomerase VI, and BCK) which all bind the drug radicicol.
Protein three-dimensional (3D) structural information help to understand functional protein properties and the precise mechanisms of proteins implicated in physiological and pathological processes.1 Knowledge of 3D protein structures linked to small molecules can be used for structure- and ligand-based drug design approaches.2,3 It also gives direct hints to the protein functional mechanisms. A protein’s activity often depends on a small, highly conserved set of residues within the binding site.4,5 Comparison and detection of protein binding sites are key steps for annotating structures with functional predictions. In this field, Structural Genomics consortia have radically changed mankind’s base of protein structural knowledge. Their endeavors have permitted the resolution of numerous structures characterized as “Unknown function”, and multiple functional sites are not associated with any known binding partner.6 Consequently, the development of computational methods to functionally annotate protein structures has become a major research area.
The simplest approaches are based on sequence analogy, eg, PSI-BLAST,7 or on the characterization of functional patterns or profiles, eg, PROSITE.8 They help to draw on knowledge and assumptions of protein functions in assigning predicted functions. However, they cannot embrace the complexity of local 3D folds. During the past years, various methods to compare and detect binding sites have been elaborated; they use diverse types of descriptors. Their general purpose is often to create automated functional annotation methods independent from amino acid sequence or from global fold similarity, eg, CavBase,9 SiteEngine,10 FLAP,11 CPASS,12 or eF-seek.13
Some of these approaches share gross features but they also have notable distinctions. For instance, SiteEngine and CavBase both associate physico-chemical properties to structural characteristics. However, SiteEngine allows the comparison of entire protein surfaces to a binding site database, whereas CavBase is restricted to cavity comparisons. The web-based version of SiteEngine is restricted to the comparison of a single site versus one protein structure.10 CavBase detects related cavities based on a clique detection algorithm9 while CPASS comparison uses an alignment of binding site pairs through a root–mean–square–difference (RMSD) scoring function.12 Roterman has developed an innovative methodology based on irregular hydrophobicity distribution.14 A few other methods are based on the detection of conserved residues to characterize binding sites, eg, evolutionary trace method15–17 or sequence alignment with a dedicated dataset as Catalytic Site Atlas (CSA).4
In this research area, SuMo is a powerful technology to localize similar local regions on protein surfaces ie, binding sites.18 Each chemical property, or interaction, of an amino acid residue is represented by a specific surface chemical feature (SCF). These are gathered in triangles to constitute a SuMo graph vertex. Since each SCF is associated with heterogeneous geometrical properties, and that triplets have specific superimposition rules (distance, angle), the comparison heuristic is extremely rapid. The comparison of a 3D pattern against all the binding sites of the PDB can be performed in a few minutes.19 MED-SuMo is the latest evolution of SuMo software developed by MEDIT-SA (see http://www.medit.pharma.com/). Recent developments have improved its binding site database, and have included novel functional annotation tools as presented in a recent study.20
Proteins are also classified according to their folds,21 eg, SCOP (Structural Classification of Proteins),22,23 that provides a manually refined classification with detailed and comprehensive descriptions of the structural and evolutionary relationships of the known protein structure.22,23 However, a critical limitation of these fold-based classifications is the use of complete protein folds or protein domains. Similarity of fold does not necessarily correspond to a similarity of function. In this paper, we focus on an interesting SCOP superfamily which includes the heat shock protein 90 SCOP family (HSP90, see Figure 1).
HSP90 is one of the most abundant proteins. Its different forms exhibit mainly chaperone functions associated to protein folding, cell survival,24 apoptosis and tumor repression.25 It binds ATP (see Figures 2a and 2b) and is the target of some innovative drugs including geldanamycin which has enabled 50% reduction of tumor growth,26 and celasterol which disrupts interactions between HSP90 and Cdc37 in pancreatic cancer cells.27 Some recent research focussed on a new potential drug, radicicol. This molecule has a very high affinity for HSP90 (20 nM).28 Figure 3 shows the association of the drug with the HSP90 at the binding site normally filled with a natural ligand.28 However, radicicol is not specific to HSP90 as it binds bacterial Sensor Kinase PhoQ,29 and topoisomerase VI.30 An interesting detail is that HSP90 chaperone, MutL/DNA topoisomerase or histidine kinases share (see Figure 1) a common fold and that a common region of ATP-binding has been detected (see Figures 2c and 2d).
To analyze the similar and different features of this fold, we use a novel classification method, MED-SuMo Multi approach (MED-SMA), based on the MED-SuMo technology. In this work, binding sites from the SCOP superfamily ATPase domain of HSP90 chaperone/DNA topoisomerase II/histidine kinase proteins are gathered in a dataset, compared pairwise and classified using the Markov Cluster Algorithm (MCL).31 Results from this method highlight common and distinct functional features between the analyzed proteins.
SCOP web site provides the list of proteins associated to a selected fold.23 The “ATPase domain of HSP90 chaperone/DNA topoisomerase II/histidine kinase” superfamily contains 116 PDB structures (see http://scop.berkeley.edu/data/scop.b.e.ccg.A.html). The protein binding sites were selected to perform the classification.
MED-SuMo is designed to localize similar regions associated to a defined function.18–20 A key advantage is its ability to detect binding site similarities even when local flexibility is observed. Its heuristic is based on a 3D representation of macromolecules using precise SCFs. For MED-SuMo, a protein structure is represented by a set of functional groups including, for example, unbound hydrogen bond (Hbond) donors or acceptors, accessible sides of aromatic rings and carboxylate, charges, hydroxyl groups. Each feature encodes its chemical characteristics with precise geometrical properties. The overall MED-SuMo comparison methodology is presented in Figure 4. SCFs are displayed on the protein structure through a lexicographic analysis of the atoms in the PDB files, ie, a residue is represented by a set of representative SCFs (cf. Figures 4a, 4b). Their positions and orientations are filtered as shown in Figure 4c. Remaining SCFs are assembled into triplets with specific geometric characteristics, eg, edge size, perimeter, angles (cf. Figure 4d). The full triplet network is stored in the MED-SuMo database as a graph data structure where triplets are the vertices and edges connect adjacent triangles (ie, those sharing at least two SCFs).
To compare graphs, MED-SuMo looks for compatible triplets; composed of compatible SCFs (cf. Figure 4e). These triplets are called comparison “seeds”. When a seed is detected, MED-SuMo extends the comparisons to the vertices of the neighbourhood, until no more similarities are found. This process enables the formation of similar patches (common groups of SCFs) between two graphs, weighted up by the MED-SuMo score.18 These comparisons are usually performed between a query and a database of precompiled graphs. Two kinds of MED-SuMo database are commonly used: the binding site database that is composed from the SCFs around co-crystallized ligands and the full surface database, composed from SCFs covering the whole surface of each studied protein, typically the entire PDB. The database characteristics are defined by three essential parameters: the size of the ligand environment taken into account by MED-SuMo (named ligand_radius and only concerning the binding site database), the maximal distance between two SCFs to be included in a triplet (named edge_max) and the maximal perimeter for a triangle (named max_edge_sum).
As noted, MED-SuMo has an interesting and original approach to detect structural and functional similarities between protein binding sites.18–20 We decided to apply this approach to classify defined sets of structures. This new method, named MED-SuMo Multi Approach (MED-SMA), enables the comparison of all binding sites from a set of proteins using a pairwise comparison system. Matching regions are found in the binding sites to derive a similarity graph. This graph is classified with the MCL31. Figure 5 illustrates the overall procedure. For this work, MED-SMA is only applied on the MED-SuMo binding sites database.
To begin, a set of proteins is selected (see previous paragraph, cf. Figure 5a). Ligands’ characteristics are used to decide which binding sites to include in the MED-SuMo database. Once the ligands parameters are set, the database is created and the pairwise comparison is launched using the standard MED-SuMo comparison procedure.
These comparisons highlight similar regions between pairs of binding sites (cf. Figure 5b) represented by groups of SCFs called patches. Only comparisons with a MED-SuMo score higher than a fixed cut-off (parameter score_min) are accepted. Patches associated to the same binding sites are analyzed: if two patches share enough SCFs (defined by a threshold parameter named covering_factor), they are merged in a multipatch (cf. Figure 5c). A multipatch is a set of SCFs common to several binding sites of the protein set; they can also be called sub-sites. They represent the true meaningful common regions of binding sites. They have two properties: (i) enough SCFs are in common, such that binding sites are structurally and chemically similar, and (ii) they can provide a measure of sub-pocket similarity. These measures are used to compute a similarity matrix. For this matrix, the MED-SuMo score between matching multipatches is calculated (cf. Figure 5d). MCL is used to interpret the matrix through classification of the protein binding site set into clusters of sub-sites (cf. Figure 5e). A 2D plot of the clusters can be visualized using tools such as Biolayout.32,33
To generate the MED-SuMo database, only binding sites co-crystallized with ligands with more than ten atoms are selected. Of the originally selected 116 PDB structures, 101 satisfy this filter. This yields a total of 146 binding sites in the final database. Several kinds of ligands are present, purines, eg, adenosine tri-phosphate or N-ethyl-5′-carboxamido adenosine, or potential drugs, eg, Radicicol or Novobiocin. Of these 146 binding sites, 78 are from HSP90, 38 from topoisomerase/MutL, 26 are from histidine kinase, and four are from α-keto-acid dehydrogenase kinase C (BCK). The database parameters are set to a ligand radius of 6.0 Å and triangle parameters of 13 Å and 39 Å (respectively edge_max and max_edge_sum). To classify this dataset, MED-SMA takes around two minutes on a four CPU machine. The classification parameters are set to a minimal compatibility score (score_min) of 4.0 and a covering_factor of 0.6.
Here, the MED-SMA approach produces five clusters. The distribution of these clusters in regards to the SCOP families is shown in Table 1 and the composition of each cluster is available in Supplementary data 1.
Two types of MED-SMA clusters are seen. Three clusters are homogeneous as they contain only proteins from a unique SCOP family (MED-SMA clusters 1, 3, and 5). Two clusters are heterogeneous as they contain at least two SCOP families (MED-SMA clusters 2 and 4). MED-SMA clusters 1 and 3 are specific to topoisomerase/MutL while cluster 5 is specific to histidine kinase. MED-SMA cluster 2 contains binding sites from two families (ie, BCK and histidine kinase) and MED-SMA cluster 4’s binding sites are from three of the four families (HSP90, topoisomerase/MutL, and BCK).
MED-SMA clusters 1 and 3 contain 22 and 6 binding sites of the 38 proteins of the topoisomerase/MutL/DNA gyrase family, respectively. The two forms of topoisomerases IV structures of Escherichia coli (PDB code 1S14 and 1S16) share 99.5% sequence identity except for a 23 residue insertion in 1S16. These two proteins are separated by MED-SMA. A precise look at their ATP-binding sites highlights structural similarities but, above all, some strong distinctions. Figure 6 shows a 3D superimposition of these proteins. The region noted (1) on Figure 6 shows an excellent superimposition of several β-sheets and 2 α-helixes. Moreover a part of the binding sites is also similar, with a set of five SCFs well superimposed (noted  on Figure 6). Conversely, the other side of the binding site (noted  on Figure 6) is quite diverse. Ligands of these two topoisomerases are novobiocin for 1S14 and phosphoaminophosphonic acid-adenylate ester (ANP) for 1S16. They are not located at the same spatial position and their overlap is small (~10 atoms) compared to their respective sizes (44 atoms for novobiocin and 31 atoms for ANP). Furthermore, novobiocin can not fit at all in the 1S16 binding site, otherwise a steric clash appears with 1S16’s α helixes (noted  on Figure 6). Thus, binding sites from MED-SMA clusters 1 and 3 do not share sufficient similarities to be gathered by MED-SMA, neither can they bind the same kind of molecules. Interestingly, the two forms are very close but the residue insertion causes strongly diverging affinities to ligands of this class.34 So, our results reinforce the study of Bellon and colleauges. Moreover, it characterizes with elegance the fact that these two distinct local conformations are found in different related proteins.
As mentioned earlier, MED-SMA cluster 4 gathers three different SCOP families. It is the largest cluster, containing 89 binding sites. All HSP90s of the dataset are present (78 binding sites), 10 from mutL/DNA topoisomerase family (with one topoisomerase VI, five MutL, and four PMS2) and one from BCK family. Only the histidine kinase family is not represented in this MED-SMA cluster. The ligands are highly diverse with 48 unique ligands found.
Binding sites in this MED-SMA cluster share a common set of SCFs. Figure 7 shows a global superimposition of one structure of each family. The white rectangles show similarities whereas the remainder is very different as represented in the global superimposition of all the protein families in Figure 1. Figure 8 shows a close view around the radicicol. The eight labelled SCFs (circled in yellow) are shared by all superimposed structures in Figure 7. They are located all around the ligand meaning that the similarities concern the whole binding site.
The fact that MED-SMA gathers the binding sites from three different SCOP families implies a high probability that the binding modes are related. Considering the nonspecific drug radicicol which binds HSP90 and topoisomerase VI,30 we could easily make the hypothesis that this drug would also bind the different proteins included in this MED-SMA cluster.
MED-SMA clusters 2 and 5 mostly consist of histidine kinase. MED-SMA cluster 2 is heterogeneous while MED-SMA cluster 5 is homogeneous. Cluster 5 is very worthwhile because it is pure and that the dimensions of its binding sites are very similar as they all bind purine ligands. Since the binding sites gathered by MED-SMA share binding modes to ligands, this type of cluster could be used to search for specific drugs; here, drugs to inhibit histidine kinase CheA action.
Interestingly, MED-SMA cluster 2 also contains two histidine kinase CheA (PDB codes 2CH4 and 1I5D). The separation of proteins from the same family in two different clusters is due to differences between their binding sites. When 1I5D’s binding site is compared to histidine kinase CheA from cluster 5, the MED-SuMo score is less than 4.0 (which is the cut-off we chose for the pairwise comparison step). So, a drug designed to inhibit binding sites of cluster 5 would not bind (or not with the same affinity) the two excluded histidine kinase CheA binding sites.
Another interesting point on MED-SMA cluster 2 is that it contains both BCK and anti-sigma factor spoIIab. These two proteins are inhibited like HSP90 by the radicicol. However, as they are not associated to MED-SMA cluster 4, it may reflect a specific binding mode.
The detection of functional sites on protein surfaces is important for the identification of biological activity. Ligand-protein interactions occur for the majority of protein structures and they are implicated in major biological processes. However, with no help from known related sequences or structures their detection is difficult.14 Several innovative approaches have been proposed, ie, the use of hydrophobicity distribution on protein structures based on the fuzzy oil drop model,35 the destabilization of limited protein regions,36 phylogenomic classification of protein sequences37 or the classification of known protein catalytic sites.38 Prediction of protein functional sites is an important step to identify small-molecule interactions for drug discovery39 and it can be very useful to optimize drug design.40 Another valuable application is as a pre-processing step to reduce the search space for rigorous computational docking algorithms.
Methods to compare binding sites have been developed using various kinds of structural descriptors, eg, CavBase uses pseudocenters,41 and the strong hypothesis that chemical similarity and activity are linked. In this field, MED-SuMo has an interesting approach using SCFs. Each SCF represents a pertinent chemical property and is described with specific geometric rules. The search for equivalent binding sites is performed by detection of similar graphs.42 The specific geometric rules of each SCF enable the heuristic to be quite fast. So, MED-SuMo provides an interesting and original method to detect structural and functional similarities between protein binding sites. Unlike MED-SuMo, very few methods enable functional classification of sets of binding sites43 and specific binding sites are usually chosen (protein kinase) for the published work. Comparing our protocol with others is quite difficult.
Here, it is applied in a new clustering approach where the ligand environment is classified. An application to a particular protein fold, the Bergerat ATP-binding fold characterized as the ATPase domain of HSP90 chaperone/DNA topoisomerase II/histidine kinase SCOP superfamily is described here. The constituent families are quite different but their ATP binding sites appear quite alike. MED-SMA detects five different clusters. Three out of five are specific to a single family. These three MED-SMA clusters highlight the specificity of the binding sites; for example; no molecule binding to cluster 1’s binding site would also bind MED-SMA cluster 2 sites with the same interactions. The fact that the ligands are similar in MED-SMA cluster 1 and 2 (eg, ADP) emphasizes the previous observation. The ligands are the same whereas the binding modes are different. Oppositely, MED-SMA cluster 4 gathers three different families. The 3D superimposition from MED-SuMo, points out the difference of the global fold whereas the Bergerat fold can be observed (white rectangle on Figure 7). Interestingly, SCFs can be found all around the query ligand (cf. Figure 7), meaning that there is a global similarity of the binding sites from the three SCOP families. Moreover, this result is consistent with the experimental data as the proteins from these three SCOP families all bind radicicol.28–30,44
These different results demonstrate the ability of the method to gather binding sites with related binding modes. This kind of relationship between families is very interesting and their identification is a direct application for MED-SMA. Moreover, with this kind of association, we can validate the assertion that functions can be assigned to unknown proteins by associating them to a specific best matching cluster. Matching clusters rather than single structures overcomes most of the noise in both the assignments and in the functions of those assigned matches. Other applications are planned, for example, a more general kinase classification using MED-SMA is under investigation.
This example clearly shows that our approach is well suited for finding common and distinct characteristics of ligand binding pockets. Thus, close proteins can have different local binding modes, while more distant ones can share common binding features ie, a potential cross-reaction may be possible. For instance, proteins associated to radicicol are found in the same MED-SMA clusters. This approach is clearly applicable to structural genomics research. As noted by Ferrè and colleagues, functional patches associated to a large collection of protein surface cavities can be used to provide functional clues for protein with unknown structures.45 This observation is shared from our study. Thus, MED-SuMo is an approach that may improve the efficiency and effectiveness of early steps along the drug discovery path, improving early lead choices, enhancing poor leads, or aiding multivariate optimizations. This study further demonstrates that MED-SuMo is appropriate for both annotating protein structures and for deriving structural functional classifications.
Finally, with its effectiveness at dealing with the entire PDB, and the parallelisation of the computational process in course, MED-SuMo is well-suited to large-scale applications. In fact it is currently used to resolve the big challenge of the POPS project (see http://www.pops-systematic.org/) in classifying every binding site represented in the PDB.
Commercial information regarding MED-SuMo is available at http://www.medit.fr/. Questions about MED-SuMo licensing should be addressed to rf.tidem@ofni. Researcher from the Inserm Institute UMR-S 726 has no financial interests in MEDIT and collaborates with this company only for the present project. Therefore, MEDIT SA has the exclusivity for MED-SuMo sales.
|MED-SMA cluster ID||PDB_LIG_ID||Ligand name||SCOP family|
This work was supported by French Institute for Health and Medical Care (INSERM) and University Denis Diderot Paris 7. ODA’s PhD is financed by the French technical research association (ANRT) through a CIFRE grant. MEDIT holds all the rights on the presented methodology. The authors are indebted to S. Adcock for useful comments on the manuscript.