|Home | About | Journals | Submit | Contact Us | Français|
DOMINE is a comprehensive collection of known and predicted domain–domain interactions (DDIs) compiled from 15 different sources. The updated DOMINE includes 2285 new domain–domain interactions (DDIs) inferred from experimentally characterized high-resolution three-dimensional structures, and about 3500 novel predictions by five computational approaches published over the last 3 years. These additions bring the total number of unique DDIs in the updated version to 26219 among 5140 unique Pfam domains, a 23% increase compared to 20513 unique DDIs among 4346 unique domains in the previous version. The updated version now contains 6634 known DDIs, and features a new classification scheme to assign confidence levels to predicted DDIs. DOMINE will serve as a valuable resource to those studying protein and domain interactions. Most importantly, DOMINE will not only serve as an excellent reference to bench scientists testing for new interactions but also to bioinformaticans seeking to predict novel protein–protein interactions based on the DDIs. The contents of the DOMINE are available at http://domine.utdallas.edu.
Protein domains are defined as structural or functional subunits that make up proteins. They have the ability to fold into a stable structure, evolve and function independently of the rest of the protein that contains them. Domains have evolved to combine into different arrangements to form multi-domain proteins with varying functions. Proteins seldom function alone to carry out their functions. They almost always interact either stably or transiently with other proteins (as in a protein complex or a biological pathway) to perform housekeeping as well as critical cellular functions including cell signaling, trafficking and stress response.
Given that a majority of proteins are multi-domain proteins (1) and that an interaction between two proteins most often involves only a pair of constituent domains (one from each protein), understanding protein interactions at the domain level becomes critical to understanding not only the binding interfaces but also, most importantly, the causes of deleterious mutations at these interfaces. While the former can help discover unrecognized protein–protein interactions (2), the latter can help in the development of drugs to inhibit pathological interactions (3). For these simple reasons, understanding interaction between proteins at the domain level seems to be a logical step toward understanding interactions at the residue level.
Experimentally determined high-resolution three-dimensional (3D) structures are a prime resource for understanding how interactions between domains/proteins are mediated (4,5). However, the number of domain–domain interactions (DDIs) inferred from structures can only explain ~5% of protein protein–protein interactions (PPIs) in yeast and ~19% of PPIs in human (6). To expedite the discovery of previously unrecognized DDIs, computational approaches based on correlated sequence signatures and sequence co-evolution (7–9), gene-fusion (10,11), phylogenetic profiling (12), gene ontology (11,13), statistical/probabilistic frameworks (11,14–17), parsimonious principle (18,19) and machine learning (20–22) have been proposed. While these approaches have immensely contributed to the discovery of novel DDIs, the ever increasing sets of predictions make it difficult for bench scientists to access, analyze and integrate data sets scattered under a variety of formats. There was a need for an accessible online resource containing all available DDIs, known as well as predicted, under a single roof facilitating scientists to best use their time dissecting these data sets for clues on structural and evolutionary aspects of protein and domain interactions.
DOMINE (23), a comprehensive collection of known and predicted DDIs from 10 different sources, was launched in 2007 as an online database server to serve as a reference to experimental biologists testing for new interactions and to provide a rich set of DDIs to bioinformaticans seeking to understand interaction interfaces and predict novel PPIs based on DDIs. Over the last year, the database has been updated to include DDIs predicted by five new computational approaches published over the past 3 years. Updates to the 3did database, which infers the set of known DDIs from high-resolution 3D structures, has added 2285 new known interactions to DOMINE, confirming 168 of the previously predicted DDIs. The updated version now contains 6634 known interactions and 21620 predicted interactions, and features a new classification scheme to assign confidence levels to predicted DDIs.
The DOMINE database contains DDIs gathered from 15 different sources listed in Table 1. The set of known DDIs, inferred from experimentally characterized high-resolution 3D structures, were obtained from iPfam (4) and 3did (5). Updates to 3did since the launch of DOMINE has added over 2000 new known interactions to DOMINE. DDIs predicted by 13 computational approaches (8,10–13,15–22), including over 2600 novel predictions from five new methods—GPE (19), DIPD (22), K-GIDDI (13), Insite (17), DomainGA (20)—were obtained from respective publications. In cases where significance cutoff values had to be chosen to define the set of predicted DDIs, appropriate cutoffs were selected based on input from the authors. The set of all DDIs from the 15 different sources add up to 26219 unique DDIs among 5140 unique Pfam domains in the updated version, a 23% increase compared to 20513 unique DDIs among 4346 unique Pfam domains in the previous version.
In addition to the 5706 new DDIs, the updated version of DOMINE features a new classification scheme replacing the old one, which we had used to classify predicted DDIs as either high-confidence, medium-confidence or low-confidence predictions (HCP, MCP or LCP, respectively). In the inaugural version of DOMINE, we had simply classified a DDI to be HCP if it were predicted using multiple sources of information or by at least two sufficiently different methods, MCP if the domains share a GO term and LCP otherwise. In search of a classification scheme that is better than the old one, we first sought to characterize the predicted DDIs obtained from various sources in an effort to assign some sort of weight to each method. This would facilitate computing a confidence score for each predicted DDI by essentially summing the weights assigned to each of the method predicting this DDI, which could then be used to classify DDIs into one of the three confidence classes.
Assigning weights to methods is not an easy task because it would require a fair and objective comparison of the methods' performances. The set of known DDIs obtained from iPfam and/or 3did has long been used as a gold standard set of positives. Nearly all of the computational approaches in Table 1 used this set of known DDIs to assess their performance/accuracy. Since a majority of these methods used different datasets and/or different types of data (proteomic, genomic, evolutionary, gene fusion, gene ontology, etc.) to make predictions, it is nearly impossible to perform a direct comparison of the performances of these approaches. Testing all the methods on a benchmark data set is not possible because some of the methods impose unique set of constraints on the input data set: for example, RCDP (8) considers only those PPIs with both proteins having orthologous counterparts in 10 or more genomes. Typically, the percentage of predictions known to be true has been used as a metric to make indirect comparison of different methods. Assessing the performance of an approach solely based on the set of known DDIs potentially forces authors to benchmark their predictions or fine-tune their methods to maximize the percentage of predictions known to be true in an effort to demonstrate their method's superior performance. An incentive to predict what is already known sadly makes predicting novel DDIs less of a priority.
Pair-wise comparison of DDIs predicted by various methods revealed that there is little agreement even among methods such as DPEA, PE, DIPD, GPE and Insite, which used the exact same or a nearly identical data set for making predictions with the exception of DPEA and PE (Supplementary Table S1 and Supplementary Figure S1). The fact that 96.5% of DDIs predicted by DPEA were also predicted by PE could only mean one of the following three things: (a) DPEA and PE are so accurate that they both are predicting essentially what are true DDIs, (b) the input data set used to predict DDIs is in some way biased resulting in predictions that are similar regardless of the approach being used and (c) DPEA and PE methodologies are somewhat similar. Given that only about 12% of predictions by DPEA and PE are known to be true (23), reasoning (a) might not be realistic. Since DIPD on the exact same input data set makes predictions that differ from those made by DPEA and PE (Supplementary Figure S1), (b) cannot be considered a good reasoning. This leaves (c) as the only plausible explanation. The trivial scheme such as the one used previously to classify DDIs as either HCP, MCP or LCP (i) can be easily fooled into classifying DDIs predicted by nearly identical methods as HCP and (ii) will fail to account for biases in the input data set that is used to make predictions. In the inaugural version of DOMINE, the former issue was taken care of by taking the union of predictions by DPEA and PE (was referred to as LP) as a single set of predictions. We knew at that time that this was rather arbitrary and subjective, and recognized the need to formulate a reasonable scheme for classification of predicted DDIs in the updated version of DOMINE.
We decided to assign weights to methods based on how well their predictions are confirmed by others. For every pair of methods x and y, Jaccard index (or Jaccard similarity coefficient), measuring how well the set of predictions (Px) by x overlap with those (Py) of y, was computed as
Pair-wise Jaccard index scores are depicted as heat-map in Figure 1. For every method x, the ‘prediction overlap index’ is defined as
ranging from >0 to 1. For instance, a method whose predictions do not overlap with those of any of the other methods will receive a POI of one, whereas a method whose predictions overlap completely with those of at least one other method will receive a POI not more than 0.5. The POI is not indicative of a method's performance as it merely captures the degree to which the predictions made by a method overlaps with those made by the other methods. The confidence score S for each predicted DDI is defined as the sum of the POIs of methods predicting this DDI. The scoring scheme based on POIs is rather counterintuitive since predictions by a method with higher (or lower) POI are less (more) likely to have been predicted by many other methods resulting in them getting lower (higher, respectively) confidence scores.
Based on the above described strategy for computing confidence scores for predicted DDIs, we have now redefined the confidence levels of predicted DDIs using the new scheme shown in Figure 2A. A DDI is classified as an HCP if its confidence score S is at least two, or at least one with the domains involved sharing a gene ontology (GO) term, or if it is predicted by the integrated ME approach (Table 1). A DDI that is not an HCP is a MCP if its score is at least one, or domains involved share a GO term. DDIs not classified as HCP or MCP are grouped as LCPs. Figure 2B shows the number of DDIs with a confidence score S or above (black histogram; primary y-axis), and a fraction of them that are known to be true (green histogram; secondary y-axis). The latter shows that the higher the confidence score of a DDI, the more likely it is known to be true (R2=0.98), providing credibility to the strategy used to compute the confidence scores. The stacked histogram in Figure 2C shows, for each method, the fraction of its predictions classified as HCP, MCP and LCP. DOMINE's contents are summarized in Figure 3.
The DOMINE database is freely available at http://domine.utdallas.edu. A user-friendly web interface was developed and tested on Linux and Windows environments using Internet Explorer, Firefox and Safari web browsers. The database is stored using MySQL. ‘Browse' option can be used to view DDIs by Pfam domain name. Users may also browse domains based on GO classification. The powerful ‘search' option can be used to search for one or more domains using keywords (e.g. kinase), Pfam ID (e.g. HSP90) or accession (e.g. PF00061), Interpro ID (e.g. IPR004825) or GO term (e.g. transcription or GO:0006468). Clicking on a domain name (Pfam ID) takes the user to the results page displaying DDIs involving this domain (Figure 4). Each DDI is annotated with Interpro and GO IDs as well as source of origin and whether or not it is known to be true, etc. The entire database can be downloaded as a zip-compressed file, which includes a README file. Data within the files are tab- or ‘|'-separated.
The DOMINE database is a comprehensive collection of known and predicted DDIs from 15 different sources. It also serves as a one-stop resource for domain-specific information with links provided to popular databases including Pfam, Interpro and GO. Currently, DOMINE only supports DDIs based on Pfam domain definitions. In the future, we plan on making it support other popular domain definitions including the CDD and the SCOP.
Supplementary Data are available at NAR Online.
This work was supported by the Intramural Research Program of the National Institute of Environmental Health Sciences, National Institute of Health (Project number Z01ES102625-02 to R.J.).
Conflict of interest statement. None declared.
We thank Sarah Hunter from EBI for the latest Pfam-GO-Interpro mappings.