|Home | About | Journals | Submit | Contact Us | Français|
Literature curation of protein interaction data faces a number of challenges. Although curators increasingly adhere to standard data representations, the data that various databases actually record from the same published information may differ significantly. Some of the reasons underlying these differences are well known, but their global impact on the interactions collectively curated by major public databases has not been evaluated. Here we quantify the agreement between curated interactions from 15471 publications shared across nine major public databases. Results show that on average, two databases fully agree on 42% of the interactions and 62% of the proteins curated from the same publication. Furthermore, a sizable fraction of the measured differences can be attributed to divergent assignments of organism or splice isoforms, different organism focus and alternative representations of multi-protein complexes. Our findings highlight the impact of divergent curation policies across databases, and should be relevant to both curators and data consumers interested in analyzing protein-interaction data generated by the scientific community.
Database URL: http://wodaklab.org/iRefWeb
A myriad of cellular processes are carried out by groups of physically interacting proteins, or complexes, and the function of individual proteins often depends on their interaction partners. Substantial efforts are therefore being devoted worldwide to experimentally characterizing protein–protein interactions (PPIs) (1–9). This has in turn prompted the development of a number of specialized databases that curate and archive PPI data from the scientific literature and make them available to the scientific community (10).
Major PPI databases created in recent years such as HPRD (11), BioGRID (12) and IntAct (13), represent essentially independent annotation efforts driven by different research interests, and contain as a result complementary as well as redundant information. But exactly how much information is shared by the different databases and how much is unique, is generally not well documented, because comparing and integrating PPI information across the databases remains a challenging undertaking. The different databases apply different rules for capturing the data and often use different systems for cross-referencing genes and proteins across biological databases. Curation of the same publication by two different databases may hence result in significant discrepancies between the data that they record.
Adoption of the Proteomics Standards Initiative—Molecular Interaction (PSI-MI) controlled vocabulary and data structure (14) has been a major step forward in creating a common framework for representing PPI data. But although all major PPI databases adhere in principle to the PSI-MI standard, the actual implementations are still far from uniform. The outstanding differences prompted the creation of the IMEx consortium, committed to further unifying the PPI data representations and curation policies (15).
These standardization efforts have significantly eased the bottleneck for creating ‘meta’ resources that aggregate information from multiple PPI databases, with several such resources developed recently (16–20). But the aggregated PPI data made available by these resources are only partially normalized at best, due to many outstanding issues.
A number of problems continue to plague the curation and integration of PPI data. One problem, which further complicates the tedious task of assigning and cross-referencing gene and protein identifiers, is the annotation of protein isoforms. In some cases an interaction is specific to a particular protein isoform, whereas in others it is not. So far this information is rarely provided in the original publication. As a result the same protein may be annotated by two different databases as interacting with two different protein isoforms, each represented by a distinct identifier. Addressing this issue would require mapping the different isoforms of a protein to the corresponding gene (or ‘canonical’ isoform) (21), which is generally, but not always, uniquely defined. But this is currently not the accepted practice.
Guidelines on recording the organism in which an interaction has been observed also tend to differ. Some databases make the deliberate choice to curate only interactions pertaining to a specific organism from a given publication, or to infer interactions in a given organism (mainly human) on the basis of reported interactions in one or more related organisms (e.g. other mammals such as mouse or rat) (11,22). Problems with interpreting the published text are certainly also a factor, especially in studies of interactions in human, mouse or rat models. Indeed, cell lines from various model organisms are often used to draw conclusions about human cells. It is also not uncommon for authors to refer to previous studies for the description of cell lines and organisms, leaving it to the curator to trace earlier publications and resolve ambiguities.
The representation of multi-protein complexes identified by various detection methods (3,23,24) is yet another area where curations diverge. Complexes can be recorded either as a group of three or more associated proteins, or as a series of binary associations, depending on the practices adopted by the database (Supplementary Discussion S1). A common representation is the so-called spoke model, in which one protein is designated as a hub (or ‘bait’) and the complex is represented by a set of binary associations, each linking the bait to one of the other proteins (‘prey’) (25). Such associations may be distinguished from experimentally detected binary interactions (2,26) by examining the PSI-MI ‘interaction type’ record, since binary interactions derived from complexes are usually annotated as ‘association’ or ‘physical association’ (rather than ‘direct interaction’).
Databases also tend to differ on how they curate data sets produced by large-scale (high throughput) studies either on binary interactions or complexes. These studies often make available a high-confidence subset of the data, in addition to providing access to the full processed data set, or to the raw unprocessed data (2,27), but there is no general agreement between databases on which of these data sets is best fit for redistribution.
While all these problems are well known to database curators (28–30), the extent to which the ensuing differences impact the data currently stored across major PPI databases has so far not been quantified. The increasing number of non-experts who rely on PPI data for their research therefore often tend to ignore these problems altogether.
In this article we perform a quantitative evaluation of the level of agreement of the PPI data curated by major public databases. Our analysis is carried out on the global landscape of PPI data consolidated from nine major databases that focus primarily on the curation of experimentally derived physical PPIs: BIND (31), BioGRID (12), CORUM (22), DIP (32), IntAct (13), HPRD (11), MINT (33), MPact (34) and MPPI (35). The consolidation was performed using the Interaction Reference Index process (18) (‘Methods’ section), and data analysis was enabled by iRefWeb (http://wodaklab.org/irefweb), a web resource that serves as portal to the consolidated information (36).
The global PPI landscape with all its supporting evidence was generated by iRefIndex version 6.0. It comprised 271716 distinct physical interactions involving 70449 proteins. These interactions are associated with 1324 different organism-taxonomy identifiers and supported by a total of 42651 publications. Interactions inferred by computational methods (37,38) and genetic interactions, which represent phenotype alterations produced by the mutation/deletion of one gene in the background of a mutation/deletion of another gene (39–41), were not considered here mainly because only a small subset of the databases curate them.
To perform the evaluation, we compared the annotations derived from the same publication by different databases. Whenever two databases cite the same publication as supporting an archived interaction, we used a similarity score to quantify the agreement between, respectively, the interactions and the proteins described in the original curated records, as outlined in Figure 1. These are two basic descriptors that (in principle) uniquely define the biological entity that was annotated. Ideally, they must be specified unambiguously by the curator and can be readily analyzed programmatically.
Analysis of the 15471 shared publications reveals that on average, two databases fully agree on only 42% of the interactions and 62% of the proteins curated from the same publication, but the level of agreement for individual publications varies considerably. We then quantify how this initial level of agreement is globally impacted by factors such as divergent annotation of organisms, different splice isoform assignments and alternative representations of multi-protein complexes. Our findings highlight the role played by differences in curation policies (past or present) across databases. They also underscore the challenges that annotators face in interpreting published information and provide valuable insight into the hurdles that bioinformaticians need to overcome to integrate PPI data from multiple sources. Our study should help in formulating recommendations and developing improved software tools for all those interested in recording, integrating and analyzing protein interaction data.
In order to measure the agreement of the information curated across databases on a per-publication basis, we examine the subset of shared publications, i.e. those curated by two or more databases. We define a ‘co-citation’ as an instance of two databases citing the same publication in an interaction record (‘Methods’ section). Depending on the number of curating databases, a single publication may give rise to several pairwise co-citations (Figure 1). Only ~36% of all cited publications, numbering 15471, are shared, and those give rise to 27399 pairwise co-citations.
For each pairwise co-citations we compute two similarity scores, SPPI and SProt, which take values between 0 and 1 and measure the agreement, respectively, between the annotated PPIs, and between the annotated proteins engaged in these interactions (‘Methods’ section). Figure 2a plots the distributions of the similarity scores for the interactions (horizontal axis) and for the set of annotated proteins (vertical axis). It shows that the level of agreement between the annotated information in the analyze co-citations, ranges between full agreement and complete disagreement. On average, two databases curating the same publication agree on 42% of their interactions. The discrepancies between the sets of proteins annotated from the same publication are typically less pronounced, with the average agreement of 62%, but the overall trend is similar.
Admittedly, our criterion for agreement is quite strict, as it requires that the two databases refer to the exact same amino acid sequence and same organism taxonomy identifier when annotating each interacting protein. However, this allows us to quantify how far we are from the ideal case of perfect agreement at the starting point of our investigation. Seeing that the levels of agreement on both the annotated interactions and proteins are low, we now examine some of the factors that contribute to the observed differences.
Since the experimental characterization of PPIs tends to vary between organisms in terms of both the methodology and coverage, we investigate how the level of agreement varies across publications dealing with specific organisms. Organism information is usually unambiguously recorded using the NCBI taxonomy identifiers (42) and can therefore be readily analyzed and compared. We consider a co-citation as pertaining to a given organism when at least one of the two databases citing the corresponding publication recorded proteins from that organism. All 27399 co-citations are classified in this fashion into 1324 categories corresponding to specific organisms. In this classification a given co-citation can belong to more than one organism category, because the same publication may be interpreted differently by two databases, with for example, one database recording a human interaction but the other recording an interaction from mouse.
The organism categories vary widely in the number of co-citations that they contain and in the agreement levels of the co-citations therein. The comparison across categories is summarized in Figure 3a, which plots the SPPI and SProt scores for categories with at least 50 co-citations. Among the largest categories, co-citations corresponding to yeast Saccharomyces cerevisiae stand out as displaying the highest agreement level (SPPI and SProt averaging 63 and 80%, respectively), whereas co-citations dealing with mouse or rat display very poor agreement (with average SPPI and SProt being, respectively, 12 and 26% for mouse and 11 and 26% for rat). The average agreement for human co-citations is roughly in the middle of the range (37, 58%). For other well-studied organisms such as fission yeast Schizosaccharomyces pombe, plant Arabidopsis thaliana, worm Caenorhabditis elegans and fly Drosophila melanogaster, the average agreement is relatively high, whereas in some of the vertebrate species and in the bacteria Escherichia coli it is significantly lower (Figure 3a).
The relatively low agreement level within the vertebrate categories led us to examine the extent to which the detected discrepancies were affected by differences in splice-isoform assignments. We therefore compared the level of agreement across the PPI landscape before and after the splice isoform normalization process (‘Methods’ section). This process reduced the original set of 271716 interactions involving 70449 proteins, to that of 248465 interactions involving 63871 proteins. At the same time it increased the level of agreement from 42 to 54±41% for PPIs, and from 62 to 71±33% for proteins. Considering specific organisms, the agreement improved for many species, especially for human and fly (Figure 3b). We thus confirm that, using different splice isoforms in the description of gene products is indeed a significant contributor to annotation discrepancies.
Following these findings, information provided by the iRefWeb interface refers to proteins and interactions mapped to their canonical isoforms (36). However, information on the particular splice isoforms curated by the sources databases is preserved and can be queried, as each consolidated interaction links back to the original records from which is was derived.
Even after the consolidation of isoforms, the level of agreement associated with major organism categories, such as Mouse and Rat, remained <20% for the interactions and <30% for proteins (Figure 3b). To probe further into the impact of organism assignments, we investigated the following question: When two databases cite the same article, and one of them records all proteins as belonging to organism A, then which organism(s) does the other database record, and how often?
Results show that the disagreements on the annotated organism and annotation of PPIs from different organisms by different databases are quite common (Figure 3c). The discrepancies for the Mouse or Rat categories are dramatic. Of the 1602 co-citations in which one database records exclusively mouse proteins, the other database does so in only 242 cases (15%). Most commonly, however (in 1182 cases, or ~74%), human proteins are recorded instead. The trend is very similar for the smaller Rat category. For the category Human, there are 19 923 pairwise co-citations in which one of the two databases records exclusively human proteins. In 4729 of these cases (or 24%) the other database reports proteins from a different organism (most often mouse or rat) or a combination of organisms.
In contrast, organism assignments are much more consistent for S. cerevisiae, A. thaliana, S. pombe, worm and fly (Figure 3c), with only rare instance where interactions of other organisms (mainly human) are recorded instead. Additionally, discrepancies for the fission yeast, S. pombe, often involve the attribution of proteins to the yeast S. cerevisiae by the other database. Discrepancies are also observed for the E. coli category, but are mostly a result of different annotations of E. coli strains (Figure 3c).
In order to factor out the effects of divergent organism assignments, we further restrict our analysis to co-citations in which both databases record proteins from the same organism. In such co-citations improvements in annotation consistency of the order of 20% are observed for several organism categories (Table 1 and Figure 2c and d), confirming that divergent organism assignments contribute significantly to the observed differences.
Experiments on mammalian cells and proteins are of crucial importance to the studies of human diseases, especially when such experiments involve known disease-related proteins. However, disagreements between databases are common even in publications involving such proteins.
Figure 4a illustrates the difficulty of curating a publication describing interactions between a well-known breast cancer protein BRCA1 and another protein, BAP1, which binds to BRCA1 and enhances cell-growth suppression (43). The published text describes the interactions between the BRCA1 RING finger domain (which has the same sequence in human and mouse) with the human BAP1 protein as well as with different variants of the BAP1 mouse ortholog. The three databases that cite this study differ in their representation of BAP1 as either a human or a mouse interactor of BRCA1, with only IntAct faithfully representing both versions.
Another publication describes the interactions of only three proteins, including a well-known tumor suppressor TP53 as well as a BRCA1-associated protein BARD1 (44). The experiments were conducted using human prostate-cancer cells, rat ovarian-cancer cells and proteins from human, rat and mouse. As a result, the difficulty of correctly interpreting the paper rises dramatically and the annotations have more potential to differ. Two of the databases record only human interactions, with BIND recording them both as a single complex comprising three proteins and three distinct pairwise association (Figure 4b). In contrast, curating the same paper, IntAct records interactions involving human, mouse and rat proteins.
The presence of multiple organisms in a PPI annotation is not in itself a sign of annotation discrepancies or difficulties, since some publications may report interactions between host and pathogen proteins. Figure 4c illustrates the annotations derived from such a study, which investigated the interference of the human papillomavirus (HPV) with the human insulin-signaling pathway (45). However, only one of the two databases annotates the interactions between human and HPV proteins, whereas the other database does not record even a single HPV protein interactor, for reasons that are not clear.
Several additional examples of annotation differences can be found in the Supplementary Discussion S3.
Even when both databases completely agree on the organism assignment, and after splice isoform normalization, agreement levels for interactions for on the largest organism categories except worm do not exceed 66% (Table 1). To elucidate the factors that contribute to the outstanding differences, we analyze the two largest categories of the human-only and yeast (S. cerevisiae)-only co-citations, which together represent the bulk (74%) of all 27399 co-citations in our data.
First, we examine co-citations in which the two databases disagree on every PPI described in the publication (SPPI=0). Such co-citations comprise 17% of the human-only and 13% of the yeast-only categories (Figure 2c and d). In most of these cases the two databases have a partial overlap between the annotated sets of proteins. This suggests that a major source of the remaining disagreements on PPIs, after splice isoform normalization has been performed, is the divergent identification of individual proteins by the databases. A protein may be specified differently due either to the existence of multiple representations that cannot be easily mapped to the same gene by our consolidation procedure, or to a genuine curation discrepancy. In such cases, the divergence propagates to the corresponding PPI records in the two databases, causing them to differ even if they agree on a fraction of the interacting proteins. We anticipate that this subset of the data is enriched for papers that should be re-examined by curators of the source databases for potential issues with curation errors or other genuine differences.
In a small fraction of the co-citations disagreements on the interactions persist despite complete agreement on the proteins involved. Indeed, 1% of all human-only and 2% of yeast-only co-citations agree perfectly on the annotated proteins (SProt=1), but disagree completely on the reported interactions (SPPI=0). The main origin of these disagreements is the group versus binary representations of multi-protein complexes, as already mentioned.
For example, Figure 4d details the curated information from an experimental study (46) that identified a protein complex of the breast cancer protein BRCA1, the BRCA1-associated protein BARD1, and a cleavage-stimulation factor CSTF. The four databases that annotated the paper record the complex differently, using either multi-subunit groups, or binary expansions, or both. However, they largely agree (with BioGRID and HPRD agreeing completely) on the sets of proteins involved in these interactions.
Overall, co-citations involving groups of proteins display significantly lower agreement on PPIs (29% for human-only and 34% for yeast-only co-citations, on average) than those that deal with binary representations (72 and 70% on average, respectively). However, the agreement on the proteins involved in multi-protein groups remains rather high (76 and 86% on average, respectively; Table 2).
Representing complexes as groups of proteins, is currently not a widely adopted practice, and occurs in just 12% of the human-only and 9% of yeast-only co-citations. However, it accounts for, respectively, 32 and 22% of co-citations showing complete disagreement on PPIs (SPPI=0). Furthermore, it accounts for as many as 75% of the human-only and 45% of yeast-only co-citations showing complete disagreement on PPIs but full agreement on the proteins involved (SPPI=0, SProt=1). These numbers do not include co-citations where both databases translated the protein groups (complexes) into sets of binary interactions, which we could not systematically trace, and which we suspect may contribute to some of the remaining cases of low SPPI, mainly due to different conventions used to perform the translation.
Different pairs of databases display different levels of agreement in each organism category. The average pairwise agreement on PPIs between the source databases, and the corresponding number of co-citations for the human-only and yeast-only co-citations, respectively, are summarized in Figure 5, and the Supplementary Tables S3 and S4.
These summaries show that as many as 8340 out of 15194 human-only co-citations (or 55%) are those by BioGRID and HPRD (Figure 5a and b and Supplementary Table S3). The next largest overlap is the 1653 co-citations (11%) by BIND and HPRD, with all other overlaps not exceeding 5%. Naturally, the average agreement level between BioGRID and HPRD (SPPI=0.71, SProt=0.84) prominently affects the distribution of the similarity values for human data. Most other pairs of databases have somewhat lower, but comparable agreement levels, with the exception of CORUM, which annotates mammalian protein complexes. CORUM participates in 1018 human-only co-citations, displaying a low average agreement on PPIs (SPPI=0.27±0.38) but a high agreement on proteins (SProt=0.78±0.23) with the other databases. This low level of agreement stems from the fact that the CORUM database represents complexes as groups of associated proteins. Such group representations will invariably display disagreements with binary expansions derived from the same published information, due to differences in protein composition. Also, two group representations independently curated from the same article are, in general, less likely to have identical protein compositions, further contributing to the observed differences.
The overlap between databases for yeast-only shared publications is distributed more evenly than for the human-only articles (Figure 5c and d and Supplementary Table S4). Of the 4983 yeast-only co-citations, 21% are those by BioGRID and DIP, 17% by BioGRID and BIND, 12% by BIND and DIP, 11% by BIND and MPact, 10% by DIP and MPact, etc. As in the case of human data, the average agreement rates across different database pairs are similar, especially for pairs with a significant overlap in cited publications.
Clearly the pairwise agreement levels for these organism-specific co-citations are significantly higher than those obtained for co-citations prior to factoring out divergent protein representation and/or organism assignments. Indeed, for the latter type of co-citations many pairs of databases agree on less than half of PPIs on average (Supplementary Table S1 and Supplementary Figure S1).
Interestingly, the pairwise agreement between members of the IMEx databases (DIP, IntAct, MINT) is in general better than average, albeit similar to those of some other database pairs. This is observed both before and after elimination of some of the major discrepancy-causing factors (Figure 5, Supplementary Figure S1 and Tables S1–S4), but is unlikely to fully reflect the common curation policies adopted by this consortium, since much of the data currently stored in the IMEx databases predates the implementation of these policies.
In this study we quantified the level of agreement in the PPI data curated from 15471 publications co-cited across nine major public databases. In doing so we evaluated the global impact of several factors on the consistency of the curated information.
One key factor is the divergence in organism assignments, which was detected in 21% of all co-citations (5769 out of a total of 27399 co-citations). This divergence may sometimes results from the difficulty of interpreting the complex information reported in the original publication. Most commonly, however, it is due to the application of different curation rules, with some databases recording only interactions in the organism of interest, or systematically transferring interactions identified in one organism to its orthologs in another. Differences in organism assignments across databases should be dramatically reduced by adherence to common curation policies that would, for example, stipulate flagging interactions inferred by homology or impose stricter rules for selecting publications to curate.
Another factor, which contributes significantly to the detected differences, is the treatment of multi-protein complexes. Of the 3470 co-citations that involve complexes, only 76 are in complete agreement following the normalization of splice isoforms, indicating that up to 3394 (or 12% of the full data set) might be affected by this factor. As already mentioned, this poor agreement level is mostly due to different representations of the data that cannot be readily inter-converted (Supplementary Discussions S1 and S3). Adopting a common convention according to which multi-protein complexes identified by various purification methods are represented as groups of associated proteins (15) is a simple solution that should significantly improve agreement levels. However, once such convention is widely adopted, the criteria for quantifying the agreement between two databases curating the same reported complex should be relaxed from requiring a perfect match between the annotated proteins, as done here, to quantifying the level of overlap between the two protein lists.
By far the most crucial factor affecting the agreement levels analyzed here is the proper assignment of protein and gene identifiers across biological databases. Our analysis relies completely on the iRefIndex consolidation procedure for establishing the identity of the proteins in the PPI records. This procedure maps the variety of protein identifiers recorded by the databases to the protein amino acid sequences, using a series of steps (18). The last step, introduced recently, maps proteins to their canonical splice-isoforms (or corresponding genes).
Here we were able to evaluate the contribution to the observed disagreements both before and after this last mapping step. Our isoform normalization method was able to eliminate all disagreements on proteins in 2675 co-citations, or nearly 10% of the co-citation data set. It increased the fraction of co-citations with a perfect agreement on proteins from 29 to 39%, and those with perfect agreement on interactions from 24 to 32%.
The problem of cross-referencing proteins and genes across biological databases is an endemic one, over which the PPI databases have very limited control. Addressing it in the context of PPI curation needs to involve the cooperation of database curators, bioinformaticians as well as the authors of experimental studied (47,48). With this goal in mind, concrete proposals on how to help authors of publications provide standardized descriptions of interactions have recently been made (MIMIx: minimum information required for reporting a molecular interaction experiment) (49). Mechanisms for submitting annotations directly to the PPI databases in a unified format have also been developed (15).
Overall, our analysis lends strong support to the contention that curation policies play a key role in shaping the data collectively curated by PPI databases. These policies determine how useful the data are to the scientific community, in particular to the life scientists who routinely rely on these data for biomedical and clinical applications. Indeed, divergent organism assignments, the use of alternative protein identifiers, or different representation of complexes, although not reflecting actual curation errors, may lead to misinformation. These issues were raised in a recent study (28,30), which suggested that ‘errors’ of the type ‘wrong protein’ and ‘wrong organism’, among others, are not uncommon, and that the annotation of complexes as sets of spoke-expanded binary interactions is a potential source of concern. Our analysis has quantified these discrepancies on a global level, uncovering many more cases where pairwise discrepancies are attributable to similar issues (Examples 4–6 in the Supplementary Discussion S3).
Standardizing the curation policies along the lines advocated by the IMEx consortium, including the requirement for in-depth curation of articles (50), should go a long way towards resolving these issues. Members of the IMEx consortium also agreed to curate complementary sets of publications in order to increase coverage. We would like to suggest that this policy be revised to include a large enough number of commonly curated publications, in order to generate co-citations by IMEx members that can then be analyzed for compliance with the IMEx guidelines, using similar methods as those employed here.
The issues related to data curation are in no way limited to protein interaction data. Indeed, the importance of biological databases to the research community, combined with the rapid growth of the collected data, has highlighted a number of current limitations and needs related to the continued maintenance of such resources (51). Facing such challenges, some of the current efforts emphasize broader involvement of the research community in curation efforts (52), while others attempt to supplement or even replace manual curation with automated literature mining (53,54). But so far, the limitations of automated approaches only further underscore the many ambiguities and challenges of biocuration, indicating that manual curation is here to stay in the foreseeable future and that standardization of manual curation is an essential requirement.
Lastly, we examined the agreement level in co-citations of high-throughput articles and found it to be poor. This seems to be mainly due to the increased likelihood of differences occurring as the number of possible interacting entities grows, as well as to divergent policies for the annotations of large sets of raw versus filtered PPI data by each database (36). However, the small number of such articles contributes marginally to the discrepancies found in the co-citation the data set taken as a whole (Supplementary Discussion S2). Additional filtering of the data on the basis of various evidence codes, such as ‘interaction type’ or ‘interactions detection method’, was not performed mainly because the annotated information is frequently missing or too inconsistent to objectively evaluate agreement levels without a systematic re-examination of the original publications.
Each of the co-citations described in this article may be further explored using the ‘PubMed Report’ and ‘PubMed Detail’ utilities of the iRefWeb interface (http://wodaklab.org/iRefWeb/pubReport/), as described in detail in ref. (36). Further work with databases can now target those disagreements that are more likely due to genuine curation policy differences or curation errors. This is turn can lead to improved data curation policy and data that are more easily integrated, accessible and reliable. It is our hope that this study and its associated resources will contribute towards this goal.
Interaction data were consolidated from the following public databases and release dates, indicated in parenthesis: BIND (25 May 2005), BioGRID (7 September 2009), CORUM (8 September 2008), DIP (6 January 2009), HPRD (06 July 2009), IntAct (19 July 2009), MINT (28 July 2009), MPact (10 January 2008) and MPPI (6 January 2004). The consolidation was performed using the Interaction Reference Index process (18). iRefIndex examines amino acid sequences to establish the identity of proteins, instead of relying on gene or protein identifiers or database accession number, which are often subject to change. This enables it to reliably merge records from different databases that use distinct types of protein identifiers to support the same PPI.
The aggregated information comprised all the supporting evidence captured by the source databases using the PSI-MI controlled vocabulary (14). This includes the terms specifying the ‘interaction type’, the ‘interaction detection method’, and the corresponding literature citation, which is hyperlinked to the original PubMed identifiers. Discrepancies in the recorded information on the interaction type and detection method, while also revealing and important, were not globally monitored at this stage. The inherent ambiguities associated with this information make it very difficult to objectively quantify any detected discrepancies, let alone to interpret them.
Genetic interactions (39–41) were identified and marked for exclusion if their interaction types were defined as such by the source databases using the appropriate PSI-MI terms, as detailed in ref. (36). Inferred interactions from the OPHID database were likewise excluded (38). The iRefWeb resource (http://wodaklab.org/irefweb) provides details on the consolidation process, and views of the full original records as annotated by the source databases (36).
For all instances where two databases cite the same publication in their interaction record, we evaluate the agreement between the interactions and the proteins that they annotated from the publication. We denote such instances as ‘co-citations’. Depending on the number of databases citing the same publication, a single publication may give rise to several pairwise co-citations (Figure 1).
For each pairwise co-citation we compute two Sorensen–Dice similarity scores, SPPI and SProt. These two quantities measure the overlap, respectively, between the annotated PPIs, and between the proteins engaged in these PPIs. For sets A and B, the Sorensen–Dice similarity score is defined as the ratio of the overlap between the two sets to their average size (55):
SPPI and SProt take values between 0 and 1. For example, if a publication gives rise to a co-citation with SPPI=0.8, this indicates that each of the two co-citing databases shares with the other database, on average, 80% of its interaction records that cite this publication.
Sorensen–Dice similarity scores are non-normally distributed and display a different variance within different groups of co-citations. Therefore, the statistical significance of differences in SPPI and SProt distributions in distinct co-citation groups was computed using the non-parametric two-sample Kolmogorov–Smirnov test for equality of continuous distributions (implemented in R, http://www.r-project.org). Confidence intervals for the mean values of SPPI and SProt were computed using Student’s t-distribution, for groups containing at least 50 co-citations.
Using our criteria, two databases would disagree on an interaction if they chose different peptide sequences to represent the same protein. Recording protein identifiers that point to different splice variants of the same gene is an important example of such discrepancies. Therefore, a further consolidation step was added to the iRefIndex procedure, whereby all the proteins were mapped to the canonical UniProt isoforms (56) of the corresponding genes, whenever possible [http://irefindex.uio.no/wiki/Canonicalization and ref. (36)]. This mapping was performed mainly to further normalize the consolidated data set. The level of agreement across the PPI landscape was measured both before and after isoform normalization.
A.L.T. performed the systematic analysis of the consolidated data set of interactions. S.R. consolidated interaction data using the iRefIndex procedure. B.T. made the consolidated data amenable to analysis by implementing the iRefWeb database and query interface. I.M.D. supervised the iRefIndex project, performed initial analysis of the data, and provided illustrative test cases. S.J.W. supervised the iRefWeb project and the data-analysis project. A.L.T. and S.J.W. drafted the article.
Canadian Institutes of Health Research (MOP#82940); Canada Institute of Health Research (to S.J.W.); the Sickkids Foundation (to S.J.W.); Ontario Research Fund (to S.J.W.). Funding for open access charge: Canadian Institutes of Health Research (MOP#82940). S.J.W is Canada Research Chair, Tier 1.
Conflict of interest. None declared.
Supplementary data are available at Database Online.
The authors wish to thank the teams of the BioGRID, DIP, HPRD, IntAct and MINT databases for their valuable comments on an earlier version of this article. Special acknowledgements are due to Sandra Orchard for helpful suggestions regarding iRefWeb features and to Gianni Cesareni for sharing a related manuscript prior to publication.