Numerous recent efforts in systems biology have tried to characterize the set of all possible pairwise physical interactions or the binary protein “interactome” of an organism [1
]. Most proteins perform their functions through interactions [4
]. Thus, these large-scale maps are critical in elucidating the biological roles of functional products of genes that are identified by large-scale genome and cDNA sequencing projects. Because most of these efforts are discovery-oriented and try to explore previously unknown functionalities, it is of utmost importance to ensure that the resultant maps are of high quality. Erroneous results at this stage could propagate into both ill-conceived hypotheses and futile downstream experiments. Moreover, it has been shown that high-quality interaction networks can provide key insights into fundamental topological and biological properties of cellular systems [5
]. Although there are numerous databases [9
] that try to systematically curate the entire repository of interactions for different organisms, there has been very little effort in filtering out unreliable ones. This has led to low overlaps between independent publications and resultant confusion as to which interactions are correct [17
There are two major types of protein-protein interaction data – binary physical interactions and co-complex associations. While some databases distinguish between these two orthogonal datasets, others fail to do so. Binary interactions represent a direct biophysical interaction between two proteins. On the other hand, co-complex associations provide information about co-membership in a complex. A lot of these associations may actually represent indirect interactions [17
]. The biological information conveyed by these two kinds of interactions is different and for many applications it is necessary to have a clear distinction between these two.
There are two major methods to obtain a global map of binary interactions – literature-curation (LC) and high-throughput experiments (HT) [18
]. LC refers to systematically collecting interaction data from thousands of small-scale studies directed at validating a single or a few specific hypotheses. On the other hand, HT experiments produce large-scale interaction maps. Because most LC data are generated by hypothesis-driven experiments, it is much easier to infer biological function from those studies as compared to HT experiments. On the other hand, although the search space of some HT experiments might be focused on certain functional groups, most HT experiments are not designed to detect the presence or absence of specific interactions. Any experiment can have two kinds of bias – “assay bias” and “sampling bias”. The first arises because no assay is perfect and all experiments – HT or small-scale have their own characteristic biases [20
]. However, small-scale studies also have a sampling bias, i.e., they are typically focused on one or a few proteins of interest and hence selectively sample interactions from only a part of the search space. HT experiments are free of this sampling bias, i.e., the search space is scanned without a priori
]. Thus, for many global topological analyses, it is often necessary to use only the HT datasets.
Here, we describe a publicly available protein-protein interaction database, HINT (High-quality INT
eractomes) that directly addresses the above three issues and provides high-quality binary and co-complex interactions for human, S. cerevisiae
, S. pombe
, and O. sativa.
The binary interactomes have also been divided into LC and HT subsets. Using these datasets, we show that there are significant sociological sampling biases in LC datasets, i.e., well-studied proteins tend to have more interactions in LC datasets for both human and S. cerevisiae
. Finally, using only the high-quality HT interactions for human, we find that disease genes (i.e., genes that have a causal connection with one or more diseases) with more interactions tend to cause more diseases. Even though this result is unexpected in light of previous findings that interaction hubs are less likely to cause disease [21
], it will help understand mechanisms of various disease processes and develop corresponding treatments.