|Home | About | Journals | Submit | Contact Us | Français|
Protease substrate profiling has nowadays almost become a routine task for experimentalists, and the knowledge on protease peptide substrates is easily accessible via the MEROPS database. We present a shape-based virtual screening workflow using vROCS that applies the information about the specificity of the proteases to find new small-molecule inhibitors. Peptide substrate sequences for three to four substrate positions of each substrate from the MEROPS database were used to build the training set. Two-dimensional substrate sequences were converted to three-dimensional conformations through mutation of a template peptide substrate. The vROCS query was built from single amino acid queries for each substrate position considering the relative frequencies of the amino acids. The peptide-substrate-based shape-based virtual screening approach gives good performance for the four proteases thrombin, factor Xa, factor VIIa, and caspase-3 with the DUD-E data set. The results show that the method works for protease targets with different specificity profiles as well as for targets with different active-site mechanisms. As no structure of the target and no information on small-molecule inhibitors are required to use our approach, the method has significant advantages in comparison with conventional structure- and ligand-based methods.
Proteases are important targets in drug design, as they are part of numerous fundamental cellular processes.1 There are seven distinct classes of proteases, which are classified according to the catalytic residue: serine, threonine, cysteine, aspartate, and glutamate proteases, metalloproteases, and asparagine peptide lyases.2 Among each protease class, the reaction mechanism is highly conserved. In addition, proteases often have many closely related family members, and lead compounds often hit more than one target. Therefore, achieving target specificity when designing protease inhibitors still represents a difficult challenge.3
Current virtual screening strategies to find new small-molecule inhibitors can be divided into two groups: ligand-based approaches and structure-based approaches. To apply a ligand-based approach, information on one or more ligands that can bind to the target is required. From the set of known actives, structurally diverse compounds with similar bioactivity should be discovered.4
Structure-based methods require either an X-ray or NMR structure or a homology model of the target. Of the structure-based methods, docking and scoring is the most used method in virtual screening. However, finding the correct binding conformation through a docking experiment remains a challenging task.5 Consideration of the flexibility of the protein and ligand is not easy to achieve, even with flexible docking methods.6 Another structure-based method is pharmacophore-based virtual screening.7 The “stripping” of functional groups has the advantage that scaffold hopping is possible if topological pharmacophores are used.8
Shape-based virtual screening with ROCS9 is an alternative to docking and pharmacophore-based virtual screening.10 Virtual screening results with ROCS show higher consistency than the results of docking strategies. Inclusion of the pharmacophore properties of the query molecule allows a combination of the chemical information and the information about the shape when screening for small-molecule inhibitors. Screening of the DUD database11 using a combination of shape and pharmacophore properties revealed a superior performance of ROCS relative to docking approaches.12
With methods like proteomic identification of protease cleavage site specificity (PICS)13 and terminal isotopic labeling of substrates (TAILS)14 and the use of proteome-derived substrate libraries,13 protease specificity profiles can be readily determined. In PICS, the carboxypeptide cleavage products of an oligopeptide library, consisting of natural biological sequences derived from human proteomes, are selectively isolated, and liquid chromatography–tandem mass spectrometry (LC–MS/MS) is used to identify the prime side sequences of the cleaved peptides. Nonprime side sequences are determined through automated database searches of the human proteome. PICS thus enables simultaneous determination of prime and nonprime side sequences of cleaved peptides.13 N-TAILS allows one to distinguish between N-termini of proteins and N-termini of protease cleavage products. Dendritic polyglycerol aldehyde polymers are used to remove tryptic and C-terminal peptides. Tandem mass spectrometry is used to analyze unbound naturally acetylated, cyclized, or labeled N-termini from proteins and their protease cleavage products.15 C-TAILS complements N-TAILS and represents an isotope-encoded quantitative C-terminomics strategy to identify neo-C-terminal sequences and protease substrates.14 With the availability of those efficient approaches for protease substrate profiling, the amount of information on protease peptide substrates is growing every day. With the cleavage entropy, a metric developed in our group, quantification of protease specificity and ranking of proteases according to specificity is possible.16 The MEROPS database represents the biggest collection of known protease peptide substrates, and it is constantly being improved and updated.2 We have developed a virtual screening workflow based solely on the information on protease peptide substrate sequences present in the MEROPS database that can be used to find new small-molecule inhibitors. The types of possible interactions of the substrate peptides are the same as for small molecules. Therefore, it should be possible to find small molecules that form the same interactions with a protease as the corresponding peptide substrates. The idea of using an analysis of the protease peptide substrate space to find small-molecule inhibitors per se is not new. Recently it was shown in our group that proteases that are close in substrate space are often targeted by the same small molecules.17 Sukuru et al.18 developed a lead discovery strategy based on the similarity of proteases in the protease substrate space. They recovered the known inhibitors of proteases that are highly correlated. Their approach allows one to use a ligand-based approach to find inhibitors for proteases for which no ligands are known. However, information on small-molecule ligands for a protease that are similar in substrate space is needed in order to apply their method.
In developing a virtual screening workflow that transfers information on peptide substrate specificity to small-molecule specificity, we are faced with a complex three-dimensional problem. The relative positions of the features of the amino acid side chains in the peptide substrates and the overall shape of the bound peptide substrates are of high importance. In addition, the relative frequencies of amino acids in the peptide substrate sequences have to be considered. As a shape-based virtual screening method is most suited to address the problem and ROCS also offers the possibility to selectively weight pharmacophore features, shape-based virtual screening with ROCS is the method of choice for our virtual screening problem.
We tested our method on four targets, thrombin, factor Xa (fXa), factor VIIa (fVIIa), and caspase-3 (casp-3), which were selected according to substrate specificity profiles. In addition to showing different substrate specificities, the proteases also have different catalytic mechanisms. Thrombin, fXa, and fVIIa are serine proteases, while casp-3 is a cysteine protease. Cleavage-site sequence logos for all four targets are shown in Figure Figure11. Protease subpockets are termed S4–S4′ on the basis of the corresponding substrate positions P4–P4′ according to the convention of Schechter and Berger.19 The peptide’s scissile bond lies between P1 and P1′.
Substrate sequences were downloaded from the MEROPS database.2 Substrate positions P3–P1 were considered in a first step, as most known inhibitors for the investigated proteases bind to the corresponding protease subpockets. For casp-3, tetrapeptides ranging from P4–P1 were also explored. Unique tri- or tetrapeptide sequences were downloaded from MEROPS.
As MEROPS provides only substrate sequences but no information on substrate conformations, a way to convert the two-dimensional sequences into three-dimensional structures is needed. It is known that proteases universally recognize β-strands in their binding sites.20 To obtain peptides in β-strand conformations, we decided to use a mutation strategy based on a known X-ray structure of a protease–substrate complex downloaded from the Protein Data Bank (PDB).21 For fVIIa and fXa, no suitable complex structures could be found, so the same template was used for the three serine proteases fVIIa, fXa, and thrombin (PDB code 1FPH(22)).
For casp-3, a different template was selected, as a template protease–substrate structure was available (PDB code 2DKO(25)). The Molecular Operating Environment (MOE) software26 was used for preparation of substrate conformations. Only the template peptide substrate positions P3–P1 or P4–P1 were kept. Mutations of the selected substrate positions were carried out using the residue scan functionality within the MOE software The residue scan functionality allows one to perform single-point or multiple mutations within a peptide sequence.
Mutating each peptide position independently to all of the 20 amino acids leads to a mutational space of 60 for a tripeptide. Using the peptide substrate sequence lists, individual amino acids present in the peptide substrates listed in MEROPS were extracted from the 60 mutated sequences generated with the MOE residue scan for each position P3–P1 or P4–P1. The single amino acids for each substrate position were written to individual pdb files.
We used the DUD-E database27 for all four of our test cases. Database preparation was carried out with MOE. Duplicate entries were removed, and both the actives and decoys of all data sets were subjected to the MOE wash procedure to disconnect simple metal salts drawn in covalent notation, remove counterions and solvent molecules, add or remove explicit hydrogen atoms, and rebalance protonation states.
For shape-based virtual screening, 25 conformations for each active and decoy were created with OMEGA.28,29 The actives database for casp-3 required special attention because several entries contained not the bioactive but the prodrug form of the molecule. Prior to conformer generation, we manually hydrolyzed the lactones in the prodrug structures in MOE.
Potentially covalently bound molecules were kept, as the interactions directing the ligand into the subpockets should still be found.
To create the query for shape-based virtual screening, first each individual amino acid was loaded into vROCS, and the backbone features were disabled. For alanine, a hydrophobic feature was added because vROCS did not do this automatically as it did with the functionalities of the other amino acids. Each amino acid was then saved as a separate single amino acid query.
To create a query correctly representing the relative frequencies of amino acid side chains in the preferred substrates of the corresponding protease, the relative frequencies were first calculated in the following way: Absolute frequencies were normalized according to the number of unique peptide substrate sequences and natural occurrence of amino acids. The normalization by the natural occurrence of the amino acids was needed to remove the bias in the experimental results30 of the MEROPS peptide substrate sequences. As vROCS does not allow the number of times a feature should appear in the final query to be set, each individual amino acid query has to be loaded in according to the relative frequency in the protease peptide substrates. Since vROCS does not handle a large number of different amino acid queries to be loaded in a large number of times, we further normalized the frequencies in such a way that the most frequently occurring amino acid in the substrate has a frequency of 20. Tables with relative amino acid frequencies for each protease example can be found in Tables S1–S5 in the Supporting Information. To build the final query, each single amino acid query was loaded into vROCS according to the obtained frequency table. The query was then used in a ROCS validation run using the prepared actives and decoys data set. Of the 25 conformations for each active and decoy, only the highest ranked conformation was kept. Enrichment factors at X% (EFX%) were calculated according to the following metric:31
where Activessampled is the number of actives found at X% of the screened database, Nsampled is the number of compounds at X% of the database, Ntotal is the number of compounds in the database, and Activestotal is the number of actives in the database.
The results of the shape-based virtual screening are summarized in Table 1. In addition to enrichment factors at 1 and 2% of the database screened, enrichment factors at 5% are also shown, as they might be more relevant for industry-scale applications. Figure Figure33 shows the receiver operating characteristic (ROC) curves for the results listed in Table 1. The results of performing the virtual screening using the query of one protease with the data set of the other protease are shown in Table 2.
The results for thrombin are lowest in terms of area under the curve (AUC) when screening the DUD-E database, but at the same time, the early enrichment is highest.
The highest-ranked decoys for thrombin all show the guanidine functionality at the P1 position, which is also fundamental for substrate recognition in the thrombin peptide substrates.32
With regard to shape as well as chemical functionalities, the highest-ranked decoys look like classical thrombin inhibitors (Figure Figure44).33 The lowest-ranked actives on the one side are smaller than the ROCS query, which leads to a penalty in volume overlap and thus to a lower ranking. In addition, most of them do not have the characteristic thrombin interacting groups and in general miss functional groups that allow for strong selective interactions with the binding site.
In the same way as for thrombin, the highest-ranked decoys for fXa all contain the guanidine group and are shaped like classical fXa inhibitors. The number of peptides used for creation of the ROCS query for fXa is much lower than for thrombin, as in comparison there is little data in the MEROPS database about fXa substrates. Despite the limited number of available substrates, the AUC values are quite high when screening the DUD-E database.
Also in fVIIa the highest-ranked actives and decoys all possess the guanidine functionality at the S1 binding position. In the case of fVIIa, the lowest-ranked actives miss the guanidine functionality. They even possess negatively charged groups in some cases, in contrast to the substrate specificity at the S1 position (Figure Figure55). For fVIIa there are only nine substrates listed in the MEROPS database, which is even fewer than for fXa. Therefore, for fVIIa the vROCS query might miss some important information because of incomplete substrate data. However, in view of the low number of known substrates, it is impressive how good the method performs in terms of AUC and early enrichment.
In the case of casp-3, the carboxylate group at the S1 position seems to be required for the compound to be a high-ranked active or decoy (Figure Figure66). Interestingly, among the highest-ranked actives, several of them are prodrugs.34 If the lactone functionality in the prodrugs is not opened and converted to the bioactive form, they are ranked lowest in the virtual screen. However, if used in their bioactive form, small molecules that are administered in prodrug form are among the highest-ranked actives. As casp-3 shows typical DEVD specificity25 and thus also high specificity at S4, for casp-3 we used a model based on positions P3–P1 as well as a second model based on P4–P1. Using a broader substrate position range did not considerably improve the AUC and early enrichment. However, different actives and decoys were ranked highest, depending on how many substrate positions were used. The lowest-ranked actives were similar for both substrate position ranges, however.
As the results of the shape-based virtual screening runs may very much depend on the query conformation, we investigated the importance of the template peptide. We compared the results of using either a thrombin protease–substrate complex as the template for the mutation strategy or a casp-3 protease–substrate complex for fXa and fVIIa, for which there are no protease–substrate complexes available in the PDB. The results in Table 3 show that for the fXa DUD-E validation runs, the results do get a little worse in terms of AUC and early enrichment when a casp-3 protease–substrate complex is used as the template for the mutation strategy instead of a thrombin protease–substrate complex. For fVIIa the AUC is not affected by using a different protease–substrate complex as the template for the mutation strategy. Only the early enrichment values decrease a little when the casp-3 protease–substrate complex is used as a template instead of the thrombin protease–substrate complex. The results show that the mutation strategy works even when the peptide substrate sequences and the template peptide show low sequence identity. As long as a template peptide in an extended β-sheet conformation is available, our method can be applied.
The main advantage of our method is that it does not require a structure or knowledge of any small-molecule ligands for a virtual screening to be performed when dealing with a protease target. Only information on protease substrate sequences is required. If there are no substrates for the desired protease target listed in the MEROPS database, substrate specificity profiling is done rather quickly, in comparison with generating a structure or finding small-molecule inhibitors.
The advantage compared with the method of Sukuru et al.18 is that we directly transfer the information about the known peptide substrates for a protease to the small-molecule space. Thus, to find new inhibitors no prior knowledge of small-molecule ligands is required.
As ROCS uses a very fast and efficient algorithm for the virtual screening runs, hundreds of thousands of molecules can be screened within hours. In combination with the easy accessibility of the data required for building the query, our method has significant advantages over docking and other structure-based methods as well as ligand-based approaches using small-molecule ligands as the basis for virtual screening experiments.
We have presented a method that enables the fast and efficient derivation of a model derived from protease peptide substrate data that can be readily applied to screen for small-molecule ligands. We have applied it to four different proteases that cover different active-site mechanisms, substrate specificities, and binding-site shapes. In all four cases, the method performed well in terms of AUC and early enrichment. Even in the case of fVIIa and fXa, where available substrate data is limited, the method successfully recovered actives from the very challenging data sets prepared from the DUD-E database. The workflow described herein represents the first approach to use protease substrate sequences as the training set for a virtual screening experiment. As the query creation in vROCS allows one to include information on the relative frequencies of amino acids of substrates in the respective subpockets and focus on the properties of side chains in substrates, scaffold hopping is made possible. The method can easily be applied to different protease systems. Thus, we believe it can also be applied to members of other enzyme types, such as kinases. In summary, we have developed a new tool to be used for rational drug design, allowing the huge amount of data on protease substrates to be used for finding new small-molecule inhibitors.
B.J.W. is thankful to the Austrian Academy of Sciences for being a recipient of the DOC Grant. The authors also thank the Austrian Science Fund (FWF) for funding of Project P 23051.
† C.K.: F. Hoffmann-La Roche AG, Grenzacherstrasse 124, 4070 Basel, Switzerland.
The authors declare no competing financial interest.