All docking scores of ligand-biding site pairs were merged into a matrix form. In total, yeast profiles and human profiles are composed of 1,165 and 10,886 binding sites, respectively with thirty-five ligands (Table ). The numeric data are available in Additional files 1
. The resulting docking profiles were hierarchically clustered in both directions for the further analysis. The clustering results are shown in Figure .
The list of ligands used to generate reverse docking profiles
Figure 1 Color mapped docking fitness profile matrix in (a) yeast and (b) human. The elements represent corresponding docking fitness. The numbers in horizontal axis corresponds to ligands in Table 1. Each row presents a protein binding site. There are (a) 1,165 (more ...)
The dendrograms along the column (chemical space) in both species are very similar to each other. We compared these dendrograms with two dendrograms from PubChem Structure Clustering [12
] based on a measure recently developed for comparing two hierarchical clustering [13
]. PubChem provides two kinds of clustering based on 2D structure fingerprint and 3D shape/feature similarity. Our two dendrograms are more similar with the clustering based on 3D similarity than that based on 2D similarity (Table ). The reason may be that 3D conformations of ligands are more relevant for the protein-ligand docking fitness than 2D information. We also suggest that the topology of ligand clusters can be used as a new similarity measure between two small molecules. Nevertheless, relatively high similarity scores indicate that our docking profile data convey accurate information on their chemical properties.
The similarities among hierarchical clusterings in ligand space.
The "druggability" of a certain target protein represents how probable the protein is in fact a real target of drugs, and it has been investigated in many previous studies [14
]. In one such method, the druggability of a protein was inferred from its homologous proteins whose druggabilities were already known [17
]. The weakness of this method is that the number of targets with known druggability is limited. Other approaches attempted to define "druggable" as "highly likely to bind to putative drugs", i.e., "bindability" [18
In the context of bindability, the docking profiles in this study can provide good large scale simulated data. We first checked whether our data were in accord with predefined druggability dataset [20
]. The n
edundant set of d
ruggable and l
ruggable (NRDLD) set contains 71 druggable and 44 less-druggable targets. Since not all the entries are human proteins, the numbers of overlapped targets are 43 druggable and 8 less-druggable. Figure shows the average docking fitness values of the sets of those overlapped targets, along with the averages of 10,886 binding sites' docking scores in human data set. Except for the case of cimetidine (Ligand 6), all averages of our docking scores of druggable set are greater than those of less-druggable set. One can observe that overall average values are placed between druggable and less-druggable sets in nearly all cases. This result suggests that without serious training it may be possible to classify all protein targets whose docking profiles are available into druggable or less-druggable target.
Figure 2 Average docking scores of NRDLD druggable and less druggable set and those of overall docking fitness. The numbers in horizontal axis represents the corresponding ligands in Table 1. The subplot represents difference between NRDLD druggable and less druggable (more ...)
Next, by simply comparing docking fitness scores of each protein target against 35 ligands with the average fitness scores of those ligands (see Methods), we predicted 539 putative druggable binding sites and 289 less-druggable binding sites (Additional file 3
). The predicted druggable and less-druggable binding sites were classified into 6 enzyme classes according to the first digit of their EC numbers [21
] (Figure ). Oxidoreductases occur more frequently in the druggable. Hydrolase, lyases and isomerases occur more frequently in the less-druggable set. Except for the case of ligase in which all relative frequencies are less than 0.1, the enzyme class distribution trends are similar in both the predicted set and NRDLD set.
Figure 3 Distributions of the enzyme classes. Distributions of the enzyme classes in the NR Putative druggable and less druggable set (assigned in this study; red bars), and NRDLD druggable and less druggable set (green bars). Note that the enzyme class distribution (more ...)
Protein function prediction based on docking profile similarity
An advantage of the present study is that docking profiles are available for both human and yeast proteomes. Yeast is the best characterized eukaryotic model organism. A variety of related resources such as chemical genomic profile [22
], whole genome knock-out library [23
], protein-protein interaction and genetic interaction data [24
] are available. We expect that docking scores generated in this study can be combined with these resources to infer novel protein targets. In addition, it is valuable to check whether orthologous protein pairs of human and yeast share similar docking profiles. In Figure , we plot two distributions of Euclidean distances of docking scores of both species; one for the orthologous pairs and the other for all human-yeast pairs. It is observed that sequence similarity is generally reflected in the similarity of docking profile across the two species.
Frequency distributions of distances. The dashed line shows distribution of Euclidean distances of docking fitness between orthologs; the solid line shows that of distances between all yeast-human pairs.
The results shown in Figure suggest that we can utilize our docking method to infer the function of proteins, especially the proteins that have no apparent orthologs with known function. To show that docking profiles contain the information which can be utilized to predict the function of proteins, we carried out a large-scale function prediction of enzymes. For 3,874,883 pairs of 5,989 human proteins and 647 yeast proteins, we collected all pairs for which EC numbers were available. We used BLAST e-value [25
] as the sequence similarity measure, and Euclidean distance between the two docking profiles as the docking profile similarity measure. The performances are shown in Figure as r
haracteristic (ROC) curve. In low false positive rate (FDR) region, using e-value yielded better performance, while docking profile similarity scores performed better in high FDR region. This limitation is due to substantial overlap between distance distributions of positive and negative pairs as shown in Figure . Although the average and median of distances of positive pairs are quite less than those of negative pairs, any single non-parametric Euclidean distance value cannot divide two groups perfectly. Relationship between docking fitness distance and enzyme function is more complex than a single threshold. However, it is observed that docking profile similarity contain positive information for function prediction which is not overlapped with the sequence information. Thus, this information would be used as a useful feature with combination of other features such as sequence similarity, structural similarity, and binding site similarity.
Figure 5 ROC curve for assigning EC numbers (up to fourth digit) of proteins using those of nearest proteins in the other species. Distances are based on BLAST e-value (sequence similarity), Euclidean distance of docking profiles (docking profile similarity), (more ...)
Here, for example, simple implementation of combination of sequence and docking profile information was tested. To cover low sensitivity of docking fitness in low FDR, a new distance was defined as follows: if BLAST e-value of a pair is less than 1e-5, e-value is used as the distance; if otherwise, Euclidean distance is used. The performance of this metric is shown in Figure (red). Note that this simple metric is never based on any serious training, feature extraction, or machine learning technique. Not considering which elements in 35-dimentional docking profile are important, and simply adding information of docking profile exhibits better performances in all area. In summary, this implies that using docking profile information together with other useful measures as features of state-of-the-art machine learning technique and increasing the size of docking profile, i.e., appending the reverse docking results of additional ligands would get close to more precise function prediction of proteins.
The docking profile data generated in this study can be applied in a variety of ways. As discussed in the previous section, it can be utilized to infer protein function. On the other hand, more common application that has been explored in several previous studies is to infer new binding targets for known drugs. Here, we present two case studies.
Binding target of 5-FC and 5-FU
5-fluorocytosine (5-FC) and 5-flurouracil (5-FU) are both fluorinated analogues of pyrimidine [26
]. The structures of the two ligands are quite similar. Therefore, not surprisingly, the docking profiles are quite similar as well. Moreover, the top-ranking binding site of both ligands is the structure of yeast exosome component, the protein product of gene rrp6 (PDB id: 2hbm
]. The structure was identified relatively recently, so 2hbm has never been annotated as putative target, not to mention druggable. Previously known mechanism of action of 5-FU is inhibition of thymidylate synthetase [28
]. Thus, the top-ranking structure, 2hbm, might be considered as a false positive. Fortunately, however, genome-wide study using tagged heterozygotes yeast mutants provided a strong evidence that rrp6 related rRNA processing exosome is a target of 5-FU [29
]. The direct binding target of 5-FU was not identified in the previous study, but the result of that research and the docking scores strongly suggest that the protein product of rrp6 is the direct binding target of 5-FU in yeast.
Protein structures from the same sequence
Similarly to the case of 5-FU, we also investigated the high-ranking targets in docking profile of cycloheximide (CHX). The top-ranking structure is the PDB structure 1q17 which is the protein structure of yeast gene Hst2, homologous to eukaryotic SIR2 [30
]. Interestingly, among three protein products (1q17, 1q14, 1q1a) of hst2 whose structures were identified by the same researchers [31
], 1q17 and 1q1a exhibited high binding fitness (1st and 6th ranked) while 1q14 showed poor binding affinity. We tried to find what caused these differences. It is known that 1q17 and 1q1a lack the 64 residue C-terminal tails of hst2 sequence in common while 1q14 is the structure of intact Hst2 (Figure ). We also found that there was the comparison study between yeast mutant strains, which lack corresponding C-terminal tail regions and wild type treating CHX [33
] (Figure ). In that study, expression of HST2-298Δ which corresponds to PDB 1q17
led to increased sensitivity to CHX. This phenotype is surprisingly well characterized in docking profile in our study as the top-ranked docking fitness value.
Figure 6 Sequence variation among derivatives of yeast gene Hst2. (a) C-termial sequences of intact Hst2 (uniprot ID: P53686) and those of its three structures. 1q17 and 1q1a are the structure of the 64 residue C-terminal deletion construct. (b) HST2-298Δ (more ...)
Are such a large number of protein structures necessary for reverse docking?
Compared to the protein sets used in previous studies, the set used in this study is quite large and has some redundancy. One may question whether all these structures contribute to the sensitivity of reverse docking. It is an important issue because docking still costs high computing power and is time-consuming.
In our dataset of human, 8,717 structures out of 10,886 structures have the hits sharing the same UniProt ID with 1,339 unique UniProt IDs. In other words, those 8,717 structures could be reduced into 1,339 structures by removing at most 7,378 structures if we filter the set with respect to only sequence redundancy. However, there are many cases where docking fitness profiles for similar sequences are quite different.
To show this property, we first carried out hierarchical clustering of docking profile of proteins. For each sub-cluster, if all the members were derivatives of the same UniProt ID, the members were merged into one. This procedure was repeated until there were no sub-clusters in which members shared the same UniProt ID. As a result, 1,710 structures were filtered out eventually, i.e., only about 20% of sequence-redundant protein structures exhibited the redundancy in docking profile. This is due to heterogeneity in PDB. There are many modified structures such as oxidized, reduced, multimeric, metal containing, and truncated forms for even a one protein sequence. Thus, we concluded that the sets of protein structures which were used in previous reverse docking studies are insufficient. For example, the interesting results from the docking of cycloheximide, which was discussed in the previous section, would have not been obtained.
Another interesting example is the main binding target of hydrocortisone, the g
eceptor (GCR). There are nine structures of the GCR in PDB. However, datasets used for reverse docking such as p
atabase (PDTD) [6
] included only two of them (PDB 1nhz
). The result of reverse docking of hydrocortisone by others [7
] using PDTD could not detect the GCR as the target. In our docking profile, PDB 3bqd
was the top-ranking protein target, which is another structure of the GCR. If we had removed redundancy based on sequence similarity, we could have not detected the real target of the GCR. Therefore, our reverse docking experiment suggests that using as many as possible protein structures in reverse docking is worthwhile in finding unknown drug targets or unexpected mode-of-action even though it costs high computation cost.