In this work, we compiled 13,822 small molecule-protein domain interactions (See Method), corresponding to 9,529 unique small molecules and 2,125 distinct protein domains. Originally, we identified 3,012 protein domains in total from these small-molecule binding proteins. Some proteins contained multiple domains and the domains (30%) that had no bound small-molecule ligand (see Method) were excluded in the following study.
Small molecule–protein domain interactions: a many to many relationship
We observed that the number of small-molecule ligands varied by each domain, with a ligand count of five on average. The overall distribution is shown in . The majority of the protein domains bound few small-molecule ligands; however, some domains interacted with hundreds of distinct small molecules, such as the trypsin-like serine protease domain (CDD accession: cd00190), carbonic anhydrase alpha I-II-III-XIII domain (CDD accession: cd03119) and HIV retropepsin domain (CDD accession: cd05482) (). In addition, we found that, although the small-molecule ligands of many protein domains spread over a wide range in chemical space, they have preferential zones in terms of physicochemical properties as indicated by the molecular weight and octanol-water partition coefficient (Supplement figure S1-A
). For example, the HIV retropepsin like domain (CDD accession: cd05482) tended to bind larger molecules (Supplement figure S1-B
); while the trypsin-like serine protease domain was prone to bind relatively diverse ligands (Supplement figure S1-C
Small-molecule ligand and protein domain associations
Top 20 protein domains binding multiple smallmolecule ligands.
On the other hand, we found that 1,168 out of the 9,529 small molecules, including drugs, were promiscuous because they bound to two or more protein domains. For an example, dexibuprofen (PubChem CID: 39912), a non-steroidal anti-inflammatory drug (NSAID), bound to both of the phospholipase A2 domain (PLA2c, CDD accession: cd00125) and albumin domain (CDD accession: cd00015). The overall distribution of the number of protein domains targeted by small molecules is shown in . It is worth noting that 73% (852) of the promiscuous small molecules were observed to bind multiple domains from different domain superfamilies. For instance, nicotinamide adenine dinucleotide phosphate (NADP, PubChem CID: 5886) bound to 103 distinct protein domains from over 20 domain superfamilies; and adenosine diphosphate (ADP, PubChem CID: 6022) interacted with as many as 191 protein domains, belonging to 57 superfamilies that are widely distributed in a biological system. Especially, 72% (842) of the total 1,168 promiscuous molecules were cognate (endogenous) molecules. These results demonstrate the versatility of small molecules, including cognate molecules and drugs, in regulating biological processes. Therefore, our analysis unveiled a many-to-many relationship between small molecules and protein domains, which led us to further investigate the relationship among protein domains as resulted from interacting with small-molecule ligands.
Pairwise protein domain associations
Based on the observation in the previous section, we noted that about 89% (1,883) of the 2,125 domains were associated with at least one other domain through binding common ligands, producing 79,160 domain pair associations. The rest 11% (242 domains) bound with “selective” ligands that interacted with only one single domain target observed in the current dataset, hence these domains did not demonstrate domain associations regarding to share common ligands. Surprisingly, among the domain pair associations, we found that 86% (67,976) of them were from different superfamilies. This clearly indicates that distinct protein domains may associated with each other in terms of small-molecule binding, despite of the differences in protein sequences or structures.
Furthermore, we investigated the strength of these domain associations. Intuitively, the more ligands sharing between two domains, the stronger the association is. In this study, we not only considered the number of common ligands, but also took similar ligands into account, as we noticed that certain ligands shared significant similarity in structure, such as ADP and adenosine triphosphate (ATP, PubChem CID: 5957). We set a similarity (Tanimoto coefficient) threshold of 0.90 to ensure high-quality domain associations identified. By incorporating ligand similarity, we observed a 6% increase in the number of domain associations identified.
For any two domains, the ligand structures of them were compared in pairwise. The number of similar ligand pairs, named NSLP score, was calculated to represent the strength of a domain association. By systematically evaluating the NSLP score for each domain pair, we found a great variation among the domain association strength (). Some domain pairs from the same superfamily tended to have high NSLP scores. For example, the bacterial photosynthetic reaction center complex M domain (CDD accession: cd09291) and bacterial photosynthetic reaction center complex L domain (CDD accession: cd09290) had an NSLP score of 926, both of which belong to the photosynthetic reaction center superfamily (CDD accession: cl08220). Particularly, we observed that certain domain pairs from different superfamilies also had high NSLP scores, indicating considerable similarities among their ligands. For instance, the nucleoside diphosphate kinase group I domain (CDD accession: cd04413) and canonical ribonuclease A domain (CDD accession: cd06265), despite that they belong to the nucleoside diphosphate kinase superfamily (CDD accession: cl00335) and ribonuclease A superfamily (CDD accession: cl00128), respectively, had an NSLP score of 151, with many being nucleotide derivative ligands. More examples of protein domain associations with high NSLP scores are listed in .
Distribution of the NSLP (number of similar ligand pair) scores between protein domain pairs
A selected list of protein domain pairs binding to same/similar small molecules.
In fact, we found that the majority of the domain associations identified in the present study were across different superfamilies. Hence, we further investigated domain superfamily associations and their strength in the same way as that for the domain association study. As a result, a number of closely related superfamilies were identified, such as the P-loop NTPase superfamily (CDD accession: cl09099) and Rossmann-fold NAD(P)(+)-binding protein superfamily (CDD accession: cl09931) were associated with a NSLP score of 625. Additional examples of superfamilies with significantly strong associations regarding to small molecule binding are listed in Supplement table S1
. This analysis demonstrates, to some extent, the deficiency of the conventional classifications based protein sequences or structures, because they cannot well represent such relationship resulted by small-molecule binding. Therefore, it indicates that our work on identifying protein domain associations based on small-molecule binding may complement the conventional approaches in protein family studies.
Protein domain network
In the previous analysis of pairwise domain associations, we not only identified closely related domains with regard to small-molecule binding, but also found some popular domains that were associating with many other domains through binding common or similar ligands. To characterize the global relationship among these protein domains, we built a domain network (see Method), consisting of 2,125 nodes (domains) and 181,145 edges (domain associations) in total. Among these nodes, about 95% (2,009) nodes connected to at least one neighboring node, named ‘connected’, while the rest 5% (116) nodes were singletons that had no edge linking to others, named ‘isolated’. Particularly, we observed that most ‘connected’ nodes (1,992) were in the giant component, the largest connected component of the network. These results suggest that the small-molecule binding domains are comprehensively associated with each other through binding small-molecule ligands.
Among the entire domain network, we observed a power-law like distribution of the node degrees (), which indicates that the nodes with higher degree (“hub” nodes) had a lower frequency in general. For example, the canonical ribonuclease A domain (CDD accession: cd06265) and nucleoside diphosphate kinase group I domain (CDD accession: cd04413), connected to as many as 690 and 676 other domains (Supplement table S2 and S3
), respectively. Moreover, the shortest path between any two nodes (domains) in the network was 2.9 on average, i.e.
any two randomly selected domains were separated by less than three steps, which suggests a small-world property of the network [32
Degree distribution of the protein domain network
Furthermore, we calculated the clustering coefficient [32
] of each node and obtained an average value of 0.5 over the network, which implies potential modularity existing in the domain network. A domain module represents a group of domain nodes that are densely inter-connected within a group, but loosely connected to nodes outside the group. When looking into these domain modules, it is not surprising to observe that domains in such modules often shared a similar biochemical mechanism in vivo
or belonged to the same superfamily. For example, the alpha carbonic anhydrase (CA) domains, including types I-II-III-X-III (CDD accession: cd03119), V (CDD accession: cd03118), IX (CDD accession: cd03150), XII–XIV (CDD accession: cd03126) and VII (CDD accession: cd03149) that catalyze CO2
hydration to bicarbonate and protons in living organisms, formed a fully inter-connected module in the network (the red module in , referred as the CA module in this work) through binding acetazolamide, the first non-mercurial diuretic drug [36
Three selected protein domain modules
In addition, we also found that some domains within a module were involved in relevant biological processes. One such example was the blue module in , which consisted of six protein domains including the PLA2c domain (CDD accession: cd00125), prostaglandin endoperoxide synthase domain (PES, CDD accession: cd09816), lipocalin domain (CDD accession: pfam00061), albumin domain (CDD accession: cd00015), the ligand binding domain of peroxisome proliferator-activated receptors (NR-LBD-PPAR, CDD accession: cd06932) and the ligand binding domain of hepatocyte nuclear factor 4 (NR-LBD-HNF4-like, CDD accession: cd06931). These domains were closely inter-connected in the network as they bound various fatty acids or derivatives. Especially, the PLA2c domain, PES domain, lipocalin domain and albumin domain had relatively stronger associations (higher NSLP scores) to each other, in which the first two domains were closely related to prostaglandin biosynthesis in arachidonic acid metabolism pathway and considered as main targets for NSAIDs; while, the latter two were responsible for transporting lipids, fatty acids and their metabolites in vivo
]. More interestingly, the NR-LBD-HNF4-like domain was also identified in this module, which was recently ‘deorphanized’ because it could be regulated by fatty acids [39
]. This result suggests that domains involved in relevant biological processes/pathways can be identified through the domain network analysis.
On the other hand, some domains involved in different pathways and superfamilies were also observed to form modules through binding common cognate molecules. For instance, the ligand binding domain of thyroid hormone receptors (NR-LBD-TR, CDD accession: cd06935), TLP-Transthyretin domain (CDD accession: cd05821) and the ligand binding domain of androgen receptors (NR-LBD-AR, CDD accession: cd07073) formed a three-node domain module (the green module in ), because they bound thyroid hormones, thyroxine (PubChem CID: 5819), triiodothyronine (PubChem CID: 5920) and a derivative, triac (PubChem CID: 5803). Despite of belonging to different superfamilies, the first two domains are known to participate in the thyroid hormone transportation and signaling process; while the NR-LBD-AR domain was recently reported to bind thyroid hormones [40
]. In fact, some modules consisting of hundreds of domains, such as the NADP or ATP binding domains, were also observed. Thus, proteins containing these highly associated domains can be effectively regulated by few common molecules in vivo
Notably, domain modules were often inter-connected to some extent, the three modules shown in . Even within the fatty acids related module (colored in blue), we can clearly identify a sub-module consisting of the PLA2c domain, PES domain, albumin domain and lipocalin domain, which inter-connected to each other with strong associations. Indeed, these four domains were also observed in larger modules including the ATP related module and NADP related module. To characterize how the domains or domain modules were organized over the entire network, we investigated the distribution of clustering coefficient and node degree. For a node, the higher the clustering coefficient is, the more likely its neighbors are inter-connected. We found that the clustering coefficients were inversely proportional to the node degrees in general (Supplement figure S2
), suggesting that the nodes within a module tend to have higher clustering coefficients, and the nodes with relatively lower clustering coefficients but higher degrees are responsible for integrating domain modules. Similar phenomenon was also observed in other networks that were in hierarchical organization [41
In summary, these results indicate that small-molecule binding domains, sharing the same biochemical mechanism (or within one superfamily), being involved in relevant biological pathways, or binding common cofactors, can be identified in the network as domain modules. The results reveal new relationships of protein domains, which may be hardly detected through conventional protein sequence or structure based approaches.
Protein domain associations for drug target identification
It is widely accepted that many marketed drugs are derived from natural products or known drugs [44
]. Thus, it is of great interest to study whether the domain associations identified in this work can be used to infer potential drug targets for drug repurposing. Among the small molecule-domain interaction dataset, we found a total of 252 drug-domain pairs, corresponding to 147 marketed drugs and 135 protein domains (Supplement table S4
). A domain network showing interactions between drugs and their protein domain targets was built, and a sub-network including the three domain modules discovered in the previous section is shown in .
Drug target identification based on protein domain and drug interaction network
Based on this network, we successfully identified potential targets for some known drugs, which were retrospectively verified by literature search (shown in ). For example, in the fatty acids related module (colored in blue), we observed that three NSAIDs, i.e.
dexibuprofen, indomethacin (PubChem CID: 3715) and diclofenac (PubChem CID: 3033), respectively interacted with several domains (solid lines in grey in ), including the PLA2c domain and PES domain. Considering the strong associations among domains in this module, one may be interested in repositioning these drugs to other domain members. Some of the predicted drug-domain associations were confirmed by literature mining (dashed line in green in ). For instance, diclofenac was reported to bind to NR-LBD-PPAR [47
], albumin [48
] and lipocalin [49
]; and indomethacin was found binding to albumin as well [50
]. Especially, it has been reported that the NR-LBD-PPAR domain contained proteins, such as peroxisome proliferator-activated receptor gamma, can be activated by many NSAIDs, including ibuprofen (PubChem CID: 3672) and flufenamic acid (PubChem CID: 3371) that produce adipogenesis and peroxisome activity in vivo
]. Thus, we may anticipate more hidden interactions with NSAIDs to be discovered by conducting a systemic assay against all protein domains in this module. Likewise, ethoxzolamide (PubChem CID: 3295) could be successfully repositioned as a ligand for other member domains in the CA module (dashed green line in ), though it only bound to two domains according to the current dataset (solid grey line in ). In fact, this CA inhibitor can inhibit almost all CA isoforms in many tissues and organs, producing various inhibitory profiles and clinical applications [36
Moreover, we could infer potential domain targets from neighboring modules. For instance, the TLP-Transthyretin domain (colored in green in ), which is responsible for transporting thyroid hormones and retinol in vertebrates, connected to several domains in the fatty acid module (colored in blue in ), though the associations were relatively weak compared to the ones within the modules. Several drugs, including levothyroxine and diflunisal, were found binding to both the fatty acid module (colored in blue in ) and the thyroid hormone related module (colored in green in ) based on the current network, hence it would be interesting to explore whether other drugs can bind to the domains across these two modules as well. From literatures, we found that flufenamic acid, a ligand of the PES domain from the fatty acid module [52
], was able to bind to the NR-LBD-PPAR domain of the same module [51
], as well as the other two domains, the NR-LBD-AR domain and TLP-Transthyretin domain, in the thyroid hormone related module (dashed line in green in ). In addition, a plant-derived naphthoquinone, shikonin (PubChem CID: 479503), which did not show interaction with either the fatty acid module or the thyroid hormone related module based on the current dataset, was reported to bind to both NR-LBD-TR domain contained receptors (PubChem AID: 1479) and PES domain contained receptors, including cyclooxygenase-1 and -2 (COX1 and COX2) [53
] (dashed line in green in ). Furthermore, it has also been reported that NSAIDs indeed compete with thyroid hormone binding in vivo
]. Similarly, based on the observed connection between the two neighboring modules (CA module and fatty acid module) due to celecoxib (PubChem CID: 2662), a selective COX2 inhibitor with nanomolar activity against the carbonic anhydrase [56
], we successfully verified a hidden interaction of alpha-CA-I-II-III-XIII domain with indomethacin, a ligand of the PES domain [57
Our analysis indicates that additional drug targets may be suggested based on the modules from the domain interaction network. Thus, it demonstrates again that the constructed domain network can be used in drug target identification for drug repurposing.