For simplicity of exposition, we begin by comparing ConCavity
's performance to a representative structural method and a representative conservation method. We use Ligsite+
as the representative structure-based method, and refer to it as “Structure
is our implementation (as indicated by superscript “+”) of a popular geometry based surface pocket identification algorithm. We demonstrate in the Methods
section that Ligsite+
provides a fair representation of these methods. We choose Jensen-Shannon divergence
) to represent conservation methods and refer to it as “Conservation
has been previously shown to provide state-of-the-art performance in identifying catalytic sites and ligand binding sites 
. We have developed three versions of ConCavity
that integrate evolutionary conservation into different surface pocket prediction algorithms (Ligsite 
, Surfnet 
, or PocketFinder 
). When the underlying algorithm is relevant, we refer to these versions as ConCavityL
, and ConCavityP
. However, for simplicity, we will use ConCavityL
as representative of these approaches and call it “ConCavity
produce predictions of ligand binding pockets and residues. The pocket predictions are given as non-zero values on a regular 3D grid that surrounds the protein; the score associated with each grid point represents an estimated likelihood that it overlaps a bound ligand atom. Similarly, each residue in the protein sequence is assigned a score that represents its likelihood of contacting a bound ligand. Conservation
only makes residue-level predictions, because it does not consider protein structure. All methods are evaluated on 332 proteins from the non-redundant LigASite 7.0 dataset 
. To evaluate pocket identification performance, we predict ligand locations on the the holo version of the dataset, in order to use the bound ligands' locations as positives. When evaluating residue predictions, we predict ligand binding residues on the apo structures, and the residues annotated as ligand binding (as derived from the holo structures) are used as positives.
We quantify the overall performance of each method's predictions in two ways. First, for both pocket and residue prediction, we generate precision-recall (PR) curves that reflect the ability of each method's grid and residue scores to identify ligand atoms and ligand binding residues, respectively. (Just as residues are assigned a range of ligand binding scores, grid points in predicted pockets get a range of scores, since there may be more evidence that a ligand is bound in one part of a pocket than another.) Second, for each set of predicted pockets (corresponding to groups of non-zero values in the 3D grid), we consider how well they overlap known ligands via the Jaccard coefficient. The Jaccard coefficient captures the tradeoff between precision and recall by taking the ratio of the intersection of the predicted pocket and the actual ligand over their union. The Jaccard coefficient ranges between zero and one, and a high value implies that the prediction covers the ligand well and has a similar volume. We assess the significance of the difference in performance of methods on the dataset with respect to a given statistic via the Wilcoxon rank-sum test.
Integrating evolutionary sequence conservation and structure-based pocket finding to predict ligand-binding sites improves on either approach alone
with its constituent structure and conservation based components. shows that, within predicted pockets, grid points with higher scores are more likely to overlap the ligand, and that the significant improvement of ConCavity
(p<2.2e−16) exists across the range of score thresholds. demonstrates that the superior performance of ConCavity
holds when predicting ligand binding residues as well (p
's ability to identify ligand binding residues is striking: across this diverse dataset, the first residue prediction of ConCavity
will be in contact with a ligand in nearly 80% of proteins. ConCavity
also maintains high precision across the full recall range: precision of 65% at 50% recall and better than 30% when all ligand-binding residues have been identified. As mentioned above, this large improvement exists when predicting ligand locations as well; however, the PR curves illustrate that fully identifying a ligand's position is more difficult for each of the methods than finding all contacting residues.
Ligand binding site prediction performance.
The ligand overlap statistics presented in also demonstrate the superior performance of ConCavity. In nearly 95% of structures, ConCavity's predictions overlap with a bound ligand. Structure's predictions overlap ligands in nearly 92% of the proteins considered. The differences between the methods become more stark when we examine the magnitude of these overlaps. Both ConCavity and Structure predict pockets with total volume (Prediction Vol.) similar to that of all relevant ligands (Ligand Vol.), but ConCavity's pockets overlap a larger fraction of the ligand volume. Thus ConCavity has a significantly higher Jaccard coefficient (p<2.2e−16). This suggests that the integration of sequence conservation with structural pocket identification results in more accurate pockets than when using structural features alone.
The overlap between predicted pockets and bound ligands in holo protein structures from the LigASite database.
also provides a direct comparison of ligand binding site prediction methods based on sequence conservation with those based on structural features. Structure outperforms Conservation, a state-of-the-art method for estimating sequence conservation. Protein residues can be evolutionarily conserved for a number of reasons, so it is not surprising that Conservation identifies many non-ligand-binding residues, and thus, does not perform as well as Structure.
ConCavity's improvement comes from integrating complementary information from evolutionary sequence conservation and structure-based pocket identification
and present pocket and residue predictions of Conservation, Structure, and ConCavity on three example proteins. In general, different types of positions are predicted by Conservation and Structure. If we consider the number of known ligand binding residues for each protein in the dataset, and take this number of top predictions for the Structure and Conservation methods, the overlap is only 26%. The residues predicted by sequence conservation are spread throughout the protein (); ligand-binding residues are often very conserved, but many other positions are highly conserved as well due to other functional constraints. In contrast, the structure-based predictions are strongly clustered around surface pockets (, left column); many of these residues near pockets are not evolutionarily conserved. However, these features provide largely complementary information about importance for ligand binding. Over the entire dataset, 68% of residues predicted by both Conservation and Structure are in contact with ligands, while only 16% and 43% of those predicted by only conservation or structure respectively are ligand binding. ConCavity takes advantage of this complementarity to achieve its dramatic improvement; it gives high scores to positions that show evidence of both being in a well-formed pocket and being evolutionarily conserved.
Evolutionary sequence conservation mapped to the surface of three example proteins.
Comparison of the binding site predictions of Structure and ConCavity on three example proteins.
The examples of and illustrate this and highlight several common patterns in ConCavity's improved predictions. For 3CWK, a cellular retinoic acid-binding protein, Structure and ConCavity's residue predictions center on the main ligand binding pocket (), while Conservation gives high scores to some positions in the binding site, but also to some unrelated residues (). Looking at the ligand location predictions (green meshes in ), Structure and ConCavity both find the pocket, but the signal from conservation enables ConCavity to more accurately trace the ligand's location. This illustrates how the pattern of functional conservation observed at the protein surface influences the shape of the predicted pocket. Ligands often do not completely fill surface pockets; if the contacting residues are conserved, our approach can suggest a more accurate shape.
The results for 2CWH () and 1G6C () demonstrate that ConCavity
can predict dramatically different sets of pockets than are obtained when considering structure alone. In 2CWH, both methods identify the ligands, but Structure
over-predicts the bottom left binding pocket and predicts an additional pocket that does not have a ligand bound. ConCavity
traces the ligands more closely and does not predict any additional pockets. Structure
performs quite poorly on the tetramer 1G6C: it predicts several pockets that do not bind ligands; it fails to completely identify several ligands; and it misses one ligand entirely. In stark contrast, ConCavity
's four predicted pockets each accurately trace a ligand. The incorporation of conservation resulted in the accurate prediction of a pocket in a region where no pocket was predicted using structure alone. Images of predictions for all methods on all proteins in the dataset are available in the Text S1
file, and ConCavity
's predictions for all structures in the Protein Quaternary Structure (PQS) database are available online.
ConCavity significantly outperforms available prediction servers
We now compare the performance of ConCavity
to several existing ligand binding site identification methods with publicly available web servers. LigsiteCS 
is an updated version of geometry-based Ligsite
, and LigsiteCSC 
is a similar structural method that considers evolutionary conservation information. Q-SiteFinder 
estimates van der Waals interactions between the protein and a probe in a fashion similar to PocketFinder. CASTp 
is a geometry-based algorithm for finding pockets based on analysis of the protein's alpha shape. Each of the servers produces a list of predicted pockets represented by sets of residues; however, none of them provide a full 3D representation of a predicted pocket. As a result, we assess their ability to predict ligand binding residues. See the Methods
section for more information on the generation and processing of the servers' predictions. In brief, the residues predicted by each server are ranked according to the highest ranking pocket to which they are assigned, i.e., all residues from the first predicted pocket are given a higher score than those from the second and so on. We re-implemented the conservation component of LigsiteCSC
, because the conservation-based re-ranking option on the web server did not work for many of the proteins in our dataset. We used JSD
as the conservation scoring method.
presents the ligand binding residue PR-curves for each of these methods. ConCavity significantly outperforms LigsiteCS, LigsiteCSC+, Q-SiteFinder, and CASTp (p<2.2e−16 for each). Surprisingly, Conservation is competitive with these structure-based approaches. Several of the servers did not produce predictions for a small subset of the proteins in the database, e.g., the Q-SiteFinder server does not accept proteins with more than 10,000 atoms. is based on 234 proteins from the LigASite dataset for which were able to obtain and evaluate predictions for all methods. Thus the curve for ConCavity is slightly different than those found in the other figures, but its performance does not change significantly.
Comparison of ConCavity with publicly available ligand binding site prediction servers.
LigsiteCSC+ is the previous method most similar to ConCavity; it uses sequence conservation to rerank the pockets predicted by LigsiteCS. LigsiteCSC+ provides slight improvement over LigsiteCS, but the improvement is dwarfed by that of ConCavity over Structure (). This illustrates the benefit of incorporating conservation information directly into the search for pockets in contrast to using conservation information to post-process predicted pockets.
The poor performance of these previous methods at identifying ligand binding residues is due in part to the fact that they do not distinguish among the residues near a predicted binding pocket. The entire pocket is a useful starting place for analysis, but many residues in a binding pocket will not actually contact the ligand. Knowledge of the specific ligand binding residues is of most interest to researchers. The predictions of our methods reflect this---residues within the same pocket can receive different ligand binding scores. The inability of previous methods to differentiate residues in a pocket from one another is one reason why we elect to use our own implementations of previous structure-based methods as representatives of these approaches in all other comparisons. See the Methods
section for more details.
We tested an additional approach for combining sequence conservation with structural information that was suggested by the observation that clusters of conserved residues in 3D often overlap with binding sites 
. Briefly, the method performs a 3D Gaussian blur of the conservation scores of each residue, and assigns each residue the maximum overlapping value. Thus residues nearby in space to other conserved residues get high scores. This approach improved on considering conservation alone, but was not competitive with ConCavity
). We also considered the clusters of conserved residues generated by the Evolutionary Trace (ET) Viewer 
. The clusters defined at 25% protein coverage were ranked by size, and residues within the clusters were ranked by their raw ET
score. This approach did not perform as well as the above clustering algorithm (data not shown), and was limited to single chain proteins, because ET
returns predictions for only one chain of multi-chain proteins.
ConCavity performs similarly for geometry and energetics based grid creation methods
In the previous sections, we used ConCavityL, which integrates evolutionary sequence conservation estimates from the Jensen-Shannon divergence (JSD) into Ligsite+, to represent the performance of the ConCavity approach. However, our strategy for combining sequence conservation with structural predictions is general; it can be used with a variety of grid-based surface pocket identification algorithms and conservation estimation methods.
gives PR-curves that demonstrate that ConCavity
provides excellent performance whether the structural approaches are based on geometric properties (Ligsite+
) or energetics (PocketFinder+
). The significant improvement holds for predicting both ligand locations in space (p<2.2e−16 for each pair) () and ligand binding residues (p
6.802e−13 for Ligsite+
, p<2.2e−16 for PocketFinder+
, p<2.2e−16 for Surfnet+
) (). The three ConCavity
versions perform similarly despite the variation in performance between Ligiste+
, and Pocketfinder+
. In the following sections we will include performance statistics for all three methods when space and clarity allow. When not presented here, results for all methods are available in the supplementary file Text S1
Comparison of different versions of ConCavity.
We have also found that ConCavity
achieves similar performance when a different state-of-the-art method 
is used to score evolutionary sequence conservation (Text S1
Structure-based methods have difficulty with multi-chain proteins
Proteins consisting of multiple subunits generally have more pockets than single-chain proteins due to the gaps that often form between chains. To investigate the effect of structural complexity on performance, we partitioned the dataset according to the number of chains present in the structure predicted by the Protein Quaternary Structure (PQS) server 
and performed our previous evaluations on the partitioned sets. gives these statistics for ConCavity
, and Conservation
. To enable side-by-side comparison, we report the area under the PR curves (PR-AUC) rather than giving the full curves.
Ligand-binding site identification performance by number of chains in structure.
As the number of chains in the structure increases, there is a substantial decrease in the performance of Structure. The pattern is seen both when predicting ligand binding residues () and pockets (). This effect is so large that, for proteins with five or more chains, Conservation outperforms Structure. The number of chains in the protein has little effect on Conservation's performance. The performance of Random on proteins with a small number of chains is slightly worse than on proteins with many chains (e.g., Residue PR-AUC for 1 chain: 0.097, 2 chains: 0.110, 3 chains: 0.127, 4 chains: 0.119, 5+ chains: 0.142), so the drop in Structure's performance is not the result of the proportion of positives in each set. These observations emphasize the importance of including multi-chain proteins in the evaluation.
The homo-tetramer 1G6C in provides an illustrative example of the failure of Structure on multi-chain proteins. There is a large gap between the chains in the center of the structure, and several additional pockets are formed at the interface of pairs of contacting chains. As seen in the figure, the large central cavity does not bind a ligand; however, it is the largest pocket predicted by Structure. This is frequently observed among the predictions. While some pockets between protein chains are involved in ligand binding, many of them are not. As the number of chains increases, so does the number of such potentially misleading pockets.
By incorporating sequence conservation information, ConCavity accurately identifies ligand binding pockets in multi-chain proteins. The conservation profile on the surface of 1G6C provides a clear example of this; the pockets that exhibit sequence conservation are those that bind ligands (). 1G6C is not an exception. ConCavity provides significant performance improvement for each partition of the dataset in all three evaluations, and greatly reduces the effect of the large number of non-ligand-binding pockets in multi-chain proteins on performance. ConCavity also provides improvement over Structure on the set of one chain proteins. This is notable because these proteins do not have between-chain gaps, so the improvement comes from tracing ligands and selecting among intra-chain pockets more accurately than using structural information alone (as in ).
ConCavity performs well on both apo and holo structures
The binding of a ligand induces conformational changes to a protein 
. As a result, the 3D structure of the binding site can differ between structures of the same protein with a ligand bound (holo) and not bound (apo). In the holo structures, the relevant side-chains are in conformations that contact the ligand, and this often defines the binding pocket more clearly than in apo structures. To investigate the effect of the additional information provided in holo structures on performance, we evaluated the methods on both sets ().
Area under the Precision-Recall curve (PR-AUC) for ligand-binding residue prediction methods on apo (unbound) and holo (bound) versions of LigASite.
As expected, all methods performed better on the holo (bound) structures than the corresponding apo (unbound) structures. However, all previous conclusions hold whether considering apo structures or holo structures; the ranking of the methods is consistent, and the improvement provided by considering conservation is similarly large. PR curves for this comparison are given in the supplementary file Text S1
. We will continue to report residue prediction results computed using the apo structures when possible in order to accurately assess the performance of the algorithms in the situation faced by ligand binding site prediction methods in the real world.
The methods better identify ligand binding sites in enzymes than non-enzymes
The LigASite apo dataset contains protein molecules that carry out a range of different functions. Enzymes are by far the most common; they make up 254 of the 332 proteins in the dataset. The remaining 78 non-enzyme ligand binding proteins are involved in a wide variety of functions, e.g., transport, signaling, nucleic acid binding, and immune system response.
compares the performance of the ligand binding site prediction methods on enzymes and non-enzymes. There is more variation within each method's performance on non-enzyme proteins, and all methods perform significantly better on the enzymes (e.g., p
3.336e−4 for ConCavityL
). Active sites in enzymes are usually found in large clefts on the protein surface and consistently exhibit evolutionary sequence conservation 
, so even though enzymes bind a wide array of substrates, these common features may simplify prediction when compared to the variety of binding mechanisms found in other proteins.
Ligand binding residue identification in enzymes and non-enzymes (LigASite apo).
Despite the drop in performance on non-enzyme proteins, the main conclusions from the earlier sections still hold. However, the improvement provided by ConCavity is not as great on the non-enzymes. This could be the result of the more complex patterns of conservation found in non-enzyme proteins, and the comparatively poor performance of Conservation in this setting. It is also possible that Ligsite+'s approach is particularly well suited to identifying binding sites in non-enzymes. Overall, these results highlight the importance of using a diverse dataset to evaluate functional site predictions.
ConCavity improves identification of drug binding sites
Knowledge of small molecule binding sites is of considerable use in drug discovery and design. Many of the techniques used to screen potential targets, e.g., docking and virtual screening, are computationally intensive and feasible only when focused on a specific region of the protein surface. Structure based surface cavity identification algorithms can guide analysis in such situations 
To test ConCavity
's ability to identify drug binding sites, we evaluated it on a set of 98 protein-drug complexes 
. The superior performance provided by ConCavity
on the diverse set of proteins considered above suggests that ConCavity
would likely be useful in the drug screening pipeline. compares the ligand overlap PR-AUC and Jaccard coefficient for the three versions of ConCavity
and their structure-based analogs. Each ConCavity
method significantly improves on the methods that only consider structural features (e.g., p
1.25e−6 on overlap PR-AUC and p
2.06e−6 on Jaccard for ConCavityL
). While the improvement is not quite as large on this dataset as that seen on the more diverse LigASite dataset, it is still significant. It is possible that this is due to the fact that drug compounds are not the proteins' natural ligands; the evolutionary conservation of the residues in binding pockets may reflect the pressures related to binding the actual ligands rather than the drugs.
Drug binding site identification.
Examples of difficult structures
While ConCavity signficantly outperforms previous approaches, its performance is not flawless. In , we give three example structures that illustrate patterns observed when ConCavity performs poorly. Handling these cases is likely to be important for further improvements in ligand binding site prediction.
Examples of difficult structures.
The first pattern common among these difficult cases is evolutionary sequence conservation information leading predictions away from actual ligand binding sites. provides an example in which the ligand binding site is less conserved than other parts of the protein. The ActR protein from Streptomyces coelicolor
(PDB: 3B6A) contains both a small molecule ligand-binding and a DNA-binding domain 
. The ligand-binding domain is in the bottom, less-conserved half of the structure. The DNA-binding domain is found in the more conserved top half of the given structure. The greater conservation of this domain causes ConCavity
to focus on the DNA-binding site over the ligand binding site. In other cases, conservation information is uninformative due to a lack of homologous sequences. Conservation estimates based on low quality sequence alignments may harm performance for some structures, but we have found that they still provide a net performance gain overall (Text S1
also provides two examples of another difficult case: ligands bound outside of clearly defined, concave surface pockets. In , ConCavity
identifies the center of the ring-shaped structure of the pentameric B-subunit of a shiga-like toxin (PDB: 1CQF) as the binding site. This protein binds to glycolipids, like the globotriaosylceramide (Gb3) shown, via a relatively flat interface that surrounds the center of the ring 
. The center cavity (ConCavity
's prediction) is filled by a portion of the A-subunit of the toxin (not included in the structure) which after binding breaks off and enters the host cell. shows the structure of a dimeric noncatalytic carbohydrate binding module (CBM29) from Piromyces equi
complexed with mannohexaose (PDB: 1GWL). The carbohydrate ligands bind in long flat clefts on the protein surface 
. Even though these sites exhibit significant evolutionary conservation, their geometry prevents them from being predicted. Instead, a less conserved pocket formed between the chains is highlighted by ConCavity
Overall, cases such as these are rare; ConCavity's predictions fail to overlap a ligand in only 5% of structures. In addition, some of these “incorrect” predictions are actually functionally relevant binding sites for other types of interactions as illustrated in .
Integrating conservation and structure improves prediction of catalytic sites
Ligand-binding sites are not the only type of functional site of interest to biologists. A large amount of attention has been given to the problem of identifying catalytic sites. As noted above, the majority of enzyme active sites are found in large clefts on the protein surface, so even though the structural methods considered in this paper were not intended to identify catalytic sites, they could perform well at this task.
gives the results of an evaluation of the methods' ability to predict catalytic sites (defined by the Catalytic Site Atlas 
) in the LigASite apo dataset. Compared to ligand binding site prediction, the relative performance of the methods is different in this context. The ConCavity
approach still significantly outperforms the others (p<2.2e−16 for Structure
8.223e−4 for Conservation
). Most surprisingly, Conservation
significantly outperforms methods based on structure alone (p
). All the methods have lower PR-AUC when predicting catalytic sites than predicting ligand-binding residues (e.g., ConCavityL
has PR-AUC of 0.315 versus 0.608); this is due in large part to the considerably smaller number of catalytic residues than ligand-binding residues per protein sequences.
Catalytic residue identification (LigASite apo).
These results imply that being very evolutionarily conserved is more indicative of a role in catalysis than being found in a surface pocket. Though catalytic sites are usually found in pockets near bound ligands, there are many fewer catalytic sites per protein than ligand-binding residues. As a result simply searching for residues in pockets identifies many non-catalytic residues. This is consistent with earlier machine learning studies that found conservation to be a dominant predictive feature 
, and it suggests that new structural patterns should be sought to improve the identification of catalytic sites.
Several previous methods have combined sequence conservation and structural properties in machine learning frameworks to predict catalytic sites 
. Direct comparison with these methods is difficult because most datasets and algorithms are not readily available. Tong et al. 
compared the precision and recall of several machine learning methods on different datasets in an attempt to develop a qualitative understanding of their relative performance. While it is not prudent to draw conclusions based on cross-dataset comparisons, we note for completeness that ConCavity
's catalytic site predictions the diverse LigASite dataset achieve higher precision (23.8%) at full recall than the maximum precision (over all recall levels) reported for methods in their comparisons.