|Home | About | Journals | Submit | Contact Us | Français|
Despite the availability of a large number of protein–protein interactions (PPIs) in several species, researchers are often limited to using very small subsets in a few organisms due to the high prevalence of spurious interactions. In spite of the importance of quality assessment of experimentally determined PPIs, a surprisingly small number of databases provide interactions with scores and confidence levels. We introduce HitPredict (http://hintdb.hgc.jp/htp/), a database with quality assessed PPIs in nine species. HitPredict assigns a confidence level to interactions based on a reliability score that is computed using evidence from sequence, structure and functional annotations of the interacting proteins. HitPredict was first released in 2005 and is updated annually. The current release contains 36930 proteins with 176983 non-redundant, physical interactions, of which 116198 (66%) are predicted to be of high confidence.
Protein–protein interactions (PPIs) are vital for cellular function in organisms and hence their detection is of considerable importance. The advent of high-throughput technologies has lead to a manifold increase in the PPI information in several model organisms through large scale yeast two hybrid (Y2H) and tandem affinity purification in combination with mass spectrometry (TAP/MS) experiments. However, this data has two major drawbacks leading to its limited usage—(i) the large number of spurious interactions detected (1) and (ii) the absence of direct binary interaction information in protein co-complex data obtained from TAP/MS experiments (2). As a result, most studies using PPI information either use data obtained exclusively from small-scale experiments, or those confirmed in multiple experiments. Both types of interaction subsets are considered high confidence but constitute only a fraction of the amount of data available (3) and their use can often lead to biased results. An alternative approach is to utilize the high confidence subsets provided by authors of high-throughput experiments. However, these interaction subsets are assessed using a range of techniques with differing accuracies making comparisons among data sets difficult. Frequently, such high confidence interaction subsets are available only for one or two species, typically yeast and human. As a result, a large amount of the PPI information in several species, though correct and potentially useful, is often ignored.
The major reason for this lack of information usage is the scarcity of comprehensive PPI databases that provide confidence scores assessing the quality of the interactions. Of the many PPI databases that are currently in use [IntAct (4), BioGRID (5), BIND (6), MINT (7), DIP (8), STRING (9), MPPI (10), HPRD (11), MPACT (12) and consolidated databases like iRefWeb (13) and APHID (14)], only two provide confidence scores, namely STRING and MINT. The score in MINT relies on the number and types of experiments in which the interaction is detected without adequately utilizing the genomic annotations of the interacting proteins. STRING uses genomic association information along with homology, annotation and experiment information, but does not consider information regarding interacting domains. Furthermore, in spite of the development of a number of methods to assess interaction quality, there is no consensus on the best method and few are actually applied to multiple large interaction data sets in more than one species, or make the high confidence data sets easily accessible (15–20).
To address these issues, we introduce HitPredict (http://hintdb.hgc.jp/htp/), a database of quality assessed interactions in nine species. HitPredict combines interactions from IntAct, BioGRID and HPRD and determines the confidence level of the interactions based on a reliability score calculated using the sequence, structure and functional annotations of the interacting proteins (21). HitPredict was first introduced in 2005 as a database of high confidence PPIs from high-throughput data sets. It has since been updated annually and has now been expanded to include small-scale interactions along with a more intuitive user interface.
HitPredict contains 176983 non-redundant, physical PPIs among 36930 proteins, collated from IntAct, BioGRID and the HPRD. We selected these three databases because they have high data coverage and comprehensive annotations. Genetic interactions and those among proteins with obsolete identifiers in UniProt (22) were excluded. Annotations and links to external databases are extensively provided. Coexpression correlation coefficients of interacting proteins obtained from COXPRESdb (23) and ATTED-II (24) are also assigned for mammals and plants, respectively.
Interactions in HitPredict are differentiated into two types—small-scale and high-throughput, depending on the nature of the experiment in which they were identified. The distinction between small-scale and high-throughput experiments is ambiguous but critical, primarily because interactions from small-scale experiments are typically considered to be of high confidence. For the purposes of HitPredict, experiments with <100 interactions are considered to be small-scale and high confidence, while the rest are denoted as high-throughput. This cutoff value is based on the observation that ~90% of the interactions in experiments with <100 interactions are supported by multiple evidences (See Supplementary Data for details). The interactions are further categorized into directly observed binary interactions and those derived from protein co-complex data using the spoke model (i.e. bait interacts with each of the prey proteins). Figure 1 shows the distribution of interactions in HitPredict by source, type and species. The large number of high-throughput interactions emphasizes the need for quality assessment.
All high-throughput interactions and small-scale interactions derived from co-complex data are assessed for their reliability. Interactions from small-scale binary experiments are considered to be high confidence without the assignment of a score (See Supplementary Data for benchmark). As described in detail in our previous report (21), HitPredict calculates the reliability of interactions in the form of a likelihood ratio using naïve Bayesian networks to combine evidence from the presence of the following features:
An evaluation of the quality of prediction of the features shows that interacting Pfam domains is the most accurate, followed by common GO terms and homologous interactions respectively. The combined likelihood ratio from these features is an estimate of the posterior odds of an interaction, with one or more features, being true. A likelihood ratio greater than 1 indicates that the interaction is supported by one or more of the features and thus has a greater probability of being true. This method has good specificity and sensitivity in the confidence predictions made (21). Additionally, this scoring scheme differs from that in STRING or MINT since it does not depend on the number of experiments supporting an interaction or the number of interactions determined in a data set, and uses domain–domain interaction information with other genomic features. This makes HitPredict especially useful in identifying high confidence subsets in interactions detected in a single high-throughput experiment. Thus, it potentially provides an alternative perspective on the quality of the interaction data. This is confirmed by comparison of the total and high confidence interactions in Saccharomyces cerevisiae in HitPredict, STRING and MINT (See Supplementary Data).
Of the 176983 PPIs in HitPredict, 116198 (66%) are predicted to be of high confidence. The breakup of predicted error rates in PPIs obtained from different data sources and in different species is shown in Figure 2 and Supplementary Table S1. The presence of a large number of low confidence interactions in several data sets highlights the need for databases like HitPredict. Supplementary Figure S1 gives the percentage of high confidence interactions in HitPredict for 23 high-throughput experiments with more than 1000 interactions, published from 2000 to 2009, and shows the large number of predicted false positives in many of them. The large number of high confidence interactions predicted in high quality data sets like Collins et al. (28) illustrates the good performance of HitPredict.
HitPredict can be used for three main purposes.
Interactions for proteins can be searched for using a number of protein identifiers like UniProt ID, Entrez Gene ID, RefSeq Protein ID, the protein name or a description keyword. Selecting the protein from the results displays the interactions of the protein as a graphical network and a table (Figure 3A). The graph shows the interaction network of the query protein and its interacting partners. The color and style of the link indicates the quality of the interaction and the type of experiment in which it was detected. The table of interaction partners contains details of the confidence assigned to each interaction, the score in the form of the likelihood ratio, and the supporting evidence used to determine the score (Figure 3A). Details of individual interactions and the evidence supporting them can be seen by selecting the interaction of interest. This leads to a page giving details and annotations for the interaction, such as the source database and publications, the co-expression correlation coefficient of the genes and the protein annotations (Firgure 3B). The evidence details shown include the Pfam domains in the interacting proteins which are known to interact in 3D structures, the common GO terms and a graphical display of the homologous interactions showing the species, the score, e-value and percent identity of the homologous proteins.
For example, in order to find the high confidence interactions of the protein ‘HIF1A’, the hypoxia-inducible factor 1-α, in humans, searching for the term ‘hif1’ produces proteins in several species. Selecting the protein ‘hif1a_human’ from the search results leads a page showing 58 interactions for hif1a_human, of which, 52 are of high confidence. The interactions obtained from HitPredict can be compared to those in STRING (66 interactions of which 15 are high confidence interactions with a score >0.7) and MINT (26 interactions of which 16 may be considered high confidence with a score >0.4). In this case, the high confidence data set provided by HitPredict contains interactions over and above those from MINT and STRING. However, this may not always be the case since the number of interactions and the scoring scheme vary among databases (Supplementary Data). Thus, referring to multiple databases with distinct scoring schemes is a prudent approach.
Interactions of experiments in HitPredict, specifically high-throughput ones, can be directly searched for using the Pubmed ID. A list of these is provided in the Help section for user reference. The resulting interactions are displayed in a tabular form (Figure 3A), and interaction and evidence details can be viewed as described in previous section (Figure 3B). This feature is currently not available in STRING.
High confidence interactions from small-scale experiments can be downloaded and used either for network analysis or as gold standard data sets. Other predicted high confidence interactions can be downloaded and used to analyze large interaction networks or specific sub-networks in combination with additional data such as for transcription factors or disease genes. High confidence interaction data downloaded from HitPredict is easy to use since it is categorized by species and type of the interactions and includes the interaction details, the score and evidence used to compute the score. The use of UniProt identifiers makes it convenient to map the interacting proteins onto other protein identifiers, and annotate them. This makes it easier to use than the data downloaded from MINT, which does not include the confidence scores for all interactions, or STRING, which does not include the Pubmed IDs and uses different types of protein identifiers in different species.
Thus, HitPredict can be used to confirm the interactions of a small set of proteins or to perform large-scale feature analyses of high confidence interaction networks. It may either be used independently or in combination with other databases that provide confidence scores.
The use of multiple genomic features to calculate a reliability score, the availability of high confidence interaction subsets in several species, and the ease of obtaining these scored interaction subsets, are some of the advantages of HitPredict. It provides an additional means of identifying high confidence PPIs using an alternative scoring strategy. The use of a common scoring scheme for interactions from different experiments allows the comparison of multiple data sets and sources.
Specifically, as compared to MINT, HitPredict provides a larger interaction set, quality assessment scores for all high-throughput interactions and use of multiple genomic features for score calculation. In comparison to STRING, HitPredict provides the ability to search interactions from an experiment using Pubmed ID, uniform annotations using UniProt identifiers for easier mapping across databases, and categorized interaction files facilitating data download and large-scale analyses. Additionally, it uses information from structurally known interacting Pfam domains in the quality assessment. The presence of non-specific interactions through crystal contacts in 3DID, and the possibility that in some cases the proteins with these domains may interact differently than previously observed, seems to have a minimal effect on the performance of this feature. Indeed, this feature has the highest reliability in predicting high quality interactions indicating the minimal effect of non-specific domain interactions and confirming the previous finding that homologs of interacting protein pairs interact in a similar manner (29).
Future enhancements include incorporation of data from additional PPI databases, further annotations for proteins and interactions, and improvement of the confidence score by including experiment number and type information. User interface enhancements will enable users to view larger interaction networks in graphical format, and display homologs of proteins with links to their interactions. HitPredict updates are currently performed once a year. HitPredict has been continually maintained, updated and enhanced in the last 5 years in order to make it a comprehensive and easily accessible source of quality-assessed PPIs in multiple species.
Supplementary Data are available at NAR Online.
Funding for open access charge: Japan Society for the Promotion of Science (JSPS) through its Funding Program for World-Leading Innovative R&D in Science and Technology (FIRST Program).
Conflict of interest statement. None declared.
The authors would like to thank Dr Riu Yamashita (University of Tokyo) for help with running PSIBlast on the Super Computer System, and Dr Takeshi Obayashi (Tohoku University) for help with preparing the interaction network graphs. Computation time is provided by the Super Computer System, Human Genome Center, Institute of Medical Science, University of Tokyo.